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The  optimization  problem  of  minimizing  a  linear  objective  function  over  a  general 
convex  body  given  only  by  a  weak  membership  oracle  is  a  central  problem  in 
convex  optimization.  Traditional  methods  require  the  (expensive)  construction 
of  separating  relations  (cuts)  or  gradient  information  (which  is  often  expensive). 
We  demonstrate  how  a  sampling  procedure  can  be  used  as  the  central  routine  in 
a  randomized  polynomial  time  algorithm  for  approximately  minimizing  a  linear 
objective  function  over  an  up-monotone  convex  set  presented  by  a  membership 
oracle.  The  sampling  procedure  is  a  Markov  chain  that  uses  only  local  membership 
tests.  We  further  demonstrate  a  direct  application  of  this  technique  to  an  important 
stochastic  optimization  problem  called  “component  commonality.” 

A  second  application  of  the  above  sampling  scheme  is  a  statistical  hypothesis 
testing  procedure  involving  contingency  tables.  The  test  involves  counting  con¬ 
tingency  tables  (matrices  of  non-negative  integers)  obeying  given  row  and  column 
constraints  which  is  known  to  be  fP  hard.  Under  fairly  natural  assumptions  the 
Markov  technique  yields  an  effective  procedure  for  approximating,  to  any  de¬ 
sired  accuracy,  the  necessary  counts.  We  also  develop  a  powerful  exact  counting 
scheme,  for  fixed  dimension,  and  use  it  to  validate  the  random  technique. 
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Chapter  1 
Introduction 


This  thesis  applies  samplers  based  on  rapidly  mixing  Markov  chains  to  both  convex 
optimization  and  counting  contingency  tables.  A  sampler,  S,  is  a  randomized 
algorithm  (or  random  variable)  that,  given  a  description  of  a  feasible  set  and 
probability  measure  /i  on  the  feasible  set,  generates  a  feasible  element  such  that 
for  any  non-negligible  /i -measurable  set  1/,  Pr  [5  G  V]  «  )w(V).  This  thesis  is 
primarily  concerned  with  samplers  that  generate  points  from  convex  bodies  in  R" 
which  are  presented  by  membership  oracles.*  [23, 21,  5,  40] 

The  first  part  of  this  thesis  is  joint  work  with  Ravi  Karman  and  Sridhar  Tayur 
and  is  a  randomized  polynomial  time  algorithm  to  approximately  minimize  a  linear 
function  c  over  an  up-monotone  convex  set,  K,  (i.e.,  a  convex  set  with  the  property 
that  if  X  belongs  to  the  set  and  y  is  component  wise  greater  than  or  equal  to  x, 
then  y  belongs  to  the  set  also)  in  the  positive  orthant  given  by  a  membership 
oracle.[39]  By  “membership  oracle”  we  mean  a  black  box  procedure  that  can 
tell  us  if  a;  G  K.  It  does  not  supply  any  additional  structural  information.  It  is 
also  assumed  that  an  initial  point  x-f  G  K  is  known.  This  model  is  motivated 
by  a  stochastic  optimization  problem  called  component  commonality  [37]  where 
membership  can  be  tested  by  performing  a  numerical  integration,  but  separating 
hyperplanes  or  even  gradient  directions  are  expensive  to  obtain. 

Our  approach  is  to  optimize  using  only  local  information.  We  present  a 
non-uniform  sampler  for  R”  that  obeys  a  probability  distribution  that  grows  ge¬ 
ometrically  in  the  direction  of  the  objective  function  and  falls  off  geometrically 
in  distance  away  from  the  convex  body.  The  central  technique  is  to  design  the 

'Actually  the  intersection  of  a  very  fine  lattice  in  R"  with  a  convex  body. 
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distribution  to  be  smooth  enough  that  the  sampler  can  converge  in  polynomial 
time. 

The  second  application  of  this  thesis  is  joint  work  with  Martin  Dyer  and  Ravi 
Kannan  and  is  a  near-uniform  sampler  and  an  approximate  counting  scheme  for 
two  way  contingency  tables  that  simultaneously  obey  row  and  column  constraints. 
A  two  way  contingency  table  is  a  matrix  of  non-negative  integers.  These  tables  are 
often  used  to  represent  how  many  individuals  have  pairs  of  attributes  (i.e.  rows  are 
indexed  by  one  attribute,  columns  by  the  other)  in  a  population.  [2]  Diaconis  and 
Efron  [16]  have  proposed  a  significance  test  and  a  model  fit  that  can  be  performed 
with  the  aid  of  a  near-uniform  sampler  for  tables  that  obey  given  row  and  column 
constraints. 

We  show  how,  given  eertain  assumptions,  simple  geometry  can  be  used  to 
convert  a  near-uniform  convex  body  sampler  to  a  near-uniform  contingency  table 
sampler.  That  is:  given  certain  easily  met  conditions  on  the  row  and  column 
totals  our  continuous  techniques  can  solve  this  discrete  problem.  An  approximate 
counting  scheme  is  designed  using  this  sampler. 

We  also  develop,  with  Bemd  Sturmfels,  an  effective  exact  coimting  scheme 
for  fixed  dimension  contingency  tables.  This  scheme  helped  validate  the  random 
results  and  allowed  us  to  analyze  the  asymptotic  accuracy  of  some  estimation 
schemes  in  the  literature. 


1.1  Convex  optimization. 

In  principle  any  convex  optimization  problem  can  by  solved  in  polynomial  time 
by  the  Ellipsoid  Algorithm.  One  can  also  see  Vaidya[53]  for  a  state  of  the  art 
general  convex  programming  algorithm.  However,  these  approaches  all  require 
a  separation  oracle.  A  technique  described  by  Yudin  and  Nemirovskii  can  be 
used  to  convert  a  membership  oracle  to  a  separation  oracle  [33],  but  this  is  quite 
expensive.  Previous  efforts  to  solve  component  commonality  have  used  non-linear 
programming  methods,  but  these  require  that  gradients  be  calculated,  which  can 
be  as  expensive  as  computing  separating  cuts. 

We  solve  the  following  more  general  problem:  optimize  a  linear  function  over 
an  up-monotone  set  presented  by  a  membership  oracle.  A  set  K  C  R”  is  called 
up-monotone  \f  x  ^  K  and  y  >  x  y  ^  K.  This  property  holds  for  linear 
programs  of  the  form  Ax  >  b  when  the  matrix  A  is  non-negative.  Many  supply 
and  covering  problems  have  this  form.  We  do  not  require  that  K  be  bounded 
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but  require,  with  out  loss  of  generality,  that  our  objective  function,  c,  be  strictly 
positive  and  K  be  contained  in  the  non-negative  orthant  of  R".  Thus  for  any  d  we 
have  the  set  /T  fl  {a;  €  R”  |  c  •  a;  <  d}  is  bounded. 

The  component  commonality  problem  is  a  stochastic  or  chance  constraint 
optimization  problem.  Good  sources  for  this  problem  are  [54]  and  [37].  The 
informal  description  is  as  follows.  Every  month  a  factory  receives  orders  for  its 
m  products.  The  total  orders  for  a  month  are  considered  a  vector  in  R™  (we  will 
not  deal  with  the  integral  nature  of  this  problem  at  this  time)  distributed  according 
to  some  log-concave  density.^  We  assume  that  we  can  evaluate  F{d)  at  any  point 
d  e  R"* .  This  is  enough  to  allow  us  to  numericly  integrate  F  over  various  regions. 
We  also  have  an  x  m  non-negative  matrix  A  in  which  we  interpret  Aa,b  as  the 
amount  of  component  a  used  in  the  production  of  one  unit  of  product  b.  Our  goal  is 
to  find  a  vector  a;  €  R”  that  is  a  stocking  level  of  components  so  we  have  Ad  <  x 
with  probability  at  least  a  when  d  is  drawn  from  the  order  distribution.  That  is  if 
we  stocked  our  warehouse  according  to  the  plan  x  we  would  be  able  to  fill  all  our 
orders  with  probability  at  least  a.  It  should  be  clean  that  component  commonality 
is  a  special  case  of  up-monotone  optimization. 


1.2  Contingency  tables. 

Diaconis  and  Efron  [16]  have  developed  a  method  of  modeling  distributions  that 
allows  one  to  make  some  qualitative  and  quantitative  statements  about  an  unknown 
distribution.  This  technique  is  potentially  much  more  useful  than  the  usual  task  of 
proving  the  unknown  distribution  is  not  various  test  distributions.  Their  scheme 
requires  several  significance  tests  which  come  down  to  counting  contingency 
tables.  Unfortunately  the  counting  problem  is  difficult  (tiP  hard)  even  for  2  x  n 
tables. 

An  example  (adapted  from  [17])  is  in  order  (this  is  example  should  not  be 
construed  as  being  the  only  application). 

The  table  1.1  is  from  Snee  [49].  A  uniform  generator  of  contingency  tables 
could  be  used  to  implement  many  different  statistical  tests,  we  give  an  example 
here  chosen  for  simplicity  (rather  than  power  or  generality).  Suppose  a  census 

density,  F,  is  log-concave  if  log(i^)  is  concave.  A  density,  G,  is  concave  if  G{\x  -I-  (1  — 
F)y)  >  XG{x)  -f  (1  -  F)G{y)  for  all  x,  y  and  0  <  A  <  1.  So  F  is  log-concave  if  and  only  if 
F{Xx  +  (1  —  X)y)  >  F{x)^F(yy~^  with  x,y,X  as  above.  Any  positive  concave  function  is 
log-concave. 
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Eye  Color 

Hair  Color 

Total 

Black 

Brunette 

Red 

Blond 

Brown 

68 

119 

26 

7 

220 

Blue 

20 

84 

17 

94 

215 

Hazel 

15 

54 

14 

10 

93 

Green 

5 

29 

14 

16 

64 

Total 

108 

286 

71 

127 

592 

Table  1.1;  Eyev.s.  Hair  Color,  observed. 


Eye  Color 

Hair  Color 

Total 

Black 

Brunette 

Red 

Blond 

Brown 

40 

106 

27 

47 

220 

Blue 

39 

104 

26 

46 

215 

Hazel 

17 

45 

11 

20 

93 

Green 

12 

31 

7 

14 

64 

Total 

108 

286 

71 

127 

592 

Table  1.2:  Eye  v.s.  Hair  Color,  nearly  independent. 


were  performed  where  50  different  statistics  (hair  color,  eye  color,  income  level, 
...)  each  divided  into  discrete  categories  (Black,  Brunette,  Red,  Blond;  Brown, 
Blue,  Hazel,  Green;  0  to  6999  krona,  7000  to  13999  krona  ...)  were  collected 
on  592  people.  A  plausible  task  for  a  statistician  would  be  to  identify  pairs  of 
traits  that  support  some  meaningful  relationship.  This  could  be  the  first  step  in  an 
epidemiological  study.  One  technique  would  be  to  examine  the  =  1225  two 
way  tables.  Table  1.1  could  be  one  such  pairing.  The  statistician  does  not  wish  to 
look  at  1225  two  way  tables-  so  an  automatic  technique  of  identifying  interesting 
ones  is  required. 

Let  A  be  m  by  n  a  table  and  let  Xij  and  Cj  =  YaL\  Xi,j-  ^  would 

be  considered  uninteresting  if  the  variables  were  independent,  or  equivalently 
Xij  fa  ViCj 1 592.  A  nearly  independent  table  would  look  like  table  1.2. 

Comparing  these  two  tables  one  might  conclude  that  people  with  brown  eyes 


8 


and  blond  hair  occur  less  often  in  the  observed  table  than  in  the  nearly  independent 
table.  The  question  is  not  if  we  have  discovered  a  relationship  between  hair  and  eye 
colors  in  the  table  at  hand,  but  if  the  relationship  observed  in  this  table  is  a  likely 
due  to  sampling  error.  One  could  compute  the  statistic  of  the  observed  table 
(which  is  a  measure  of  departure  from  independence)  and  get  a  value  of  138.29. 
The  question  then  becomes  is  138.29  a  typical  or  atypical  amount  of  deviation 
from  independence?  One  method  put  forward  by  Diaconis  and  Efron  to  determine 
this  is  to  compute  the  significance  level  of  the  observed  table  with  respect  to  the 
uniform  distribution.  This  is  defined  as  the  ratio  of  the  number  of  integer  matrices 
matching  our  row  and  column  totals  that  have  a  statistic  of  at  least  138.29  to  the 
total  number  of  integer  matrices  matching  our  row  and  column  totals.  That  would 
be  to  say  “of  all  the  possible  tables  with  the  same  disjoint  distributions  (ie.  with 
the  same  r*  and  Cj)  of  hair  and  eye  colors  what  proportion  have  ^  138.29”. 

For  the  table  in  question  asymptotic  and  volume  techniques  gave  “about 
10%”[16]  and  later  randomized  techniques  gave  16.3%  (with  no  confidence 
interval)  [20].  My  current  randomized  simulation  work  shows  that  the  true  value 
is  between  15.3  and  15.7  with  a  confidence  level  of  over  99.1%?  In  any  case, 
the  result  is  that  about  1  /6  of  all  possible  tables  with  the  same  row  and  column 
sums  look  more  independent  than  the  observed  one.  So  the  relationship  between 
eye  color  and  hair  color  observed  in  the  table  is  not  very  strong  and  would  not  be 
a  good  candidate  for  further  investigation."*  The  two  way  tables  could  be  ranked 
according  to  how  atypical  they  are  (the  significance  of  their  x^)  the  human 
statistician  could  then  be  presented  the  top  few  dozen  such  tables.^ 

With  a  counting  technique  (and  some  Bayesian  priors)  much  more  sophisticated 
analysis  can  be  performed. 


^This  should  illustrate  the  difficulty  encountered  analyzing  tables  as  small  as  4  x  4. 

'‘it  is  important  to  note  that  we  are  not  saying  that  the  relationship  doesn’t  exist  or  is  statistically 
insignificant.  It  is  also  important  to  remember  that  when  the  number  of  variables  is  commensurate 
with  the  sample  population  that  one  would  expect  many  statistically  significant  but  meaningless 
cross  correlations. 

^They  could  then  discover  a  relationship  like  “unusually  few  people  who  juggle  knives  drink 
every  day”  indicating  either  few  people  partake  in  both  or  the  combination  is  fatal. 
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1.3  Basic  Markov  chains. 


Here  we  will  define  the  Markov  chain  terminology  we  are  using  in  this  thesis. 
We  will  consider  a  Markov  chain  to  be  a  finite  directed  graph  G  =  (V,E)  and  a 
transition  function  P  :V  xV  such  that 

p{^,y)  ^  0  Vx,  y  6  y 

P{x,y)>0  y{x,y)eE 

P{x,y)  =  0  \/{x,y)^E' 

Eyev  P{x,y)  =  I  Vx  e  y 

Notethatthisdefinitionrequires(x,x)  €  Eif'Z,yev  y:^x  Pi^tV)  <  1-  We  interpret 
P{x,y)  as  the  probability  that  the  Markov  chain  will  be  in  state  y  at  time  t  +  1 
given  that  it  is  in  state  x  at  time  t  (or  P[Xt+i  =  y\Xt  =  x]).  We  will  also  consider 
our  Markov  chain  as  a  linear  system  where  Xt  will  be  a  vector  in  and  by  an 
abuse  of  notation  we  take  P  to  be  the  |y  |  x  |y  |  matrix  such  that  Px,y  =  P(x,  y). 
The  idea  is  that  if  Xt  is  a  vector  such  that  >  0  and  1  •  Xj  =  1  we  can  interpret 
(Xi)i,  as  P[Xt  =  v\,  the  probability  that  our  Markov  chain  is  in  state  v  at  time  t. 
The  Markov  chain  then  progresses  imder  the  simple  relation 

X*+i  =  XtP 

As  one  would  expect  we  are  immediately  concerned  about  the  eigenvalues  and 
eigenvectors  of  these  systems. 

We  will  not  treat  Markov  chains  in  their  full  generality,  but  discuss  only  the 
“nice”  chains  used  in  this  thesis.  For  a  much  more  general  treatment  one  should 
refer  to  [  1 2] .  “Nice”  involves  some  technical  definitions  from  the  theory  of  Markov 
chains,  but  we  will  define  all  the  terms  used  in  this  section.  The  chains  we  will 
use  will  be  chosen  to  be  irreducible,  aperiodic  and  time-reversible.  Each  of  these 
terms  has  an  interpretation  in  terms  of  stochastic  processes,  graph  theory  and  linear 
operators.  To  motivate  the  definitions  we  will  point  out  all  three  interpretations 
where  appropriate. 

A  finite  Markov  chain  is  called  irreducible  if  P[Xt  =  y  infinitely  often]  = 
for  all  all  y  ^  V.  This  means  that  the  chain  has  no  transient  states  (states  that 
can  not  be  reached  from  other  sets  of  states).  If  fact  it  is  exactly  equivalent  to  the 
underlying  graph  G  being  strongly  connected  (for  every  a,b  ^  V  there  exists  a 
directed  path  of  edges  in  E  from  a  to  b).  This  condition  is  stronger  than  saying 
that  the  matrix  P  is  an  irreducible  matrix  but  is  equivalent  to  P  1^1  >0. 


10 


An  irreducible  Markov  chain  is  stationary  or  aperiodic  if  Va;,y  e 
V  P[Xt  =  y\Xo  =  x]  exists.  This  means  that  for  any  k  Xt  and 

Xt^k  are  distributed  identically  as  t  — >■  oo.  It  is  easy  to  see  that  for  an  irreducible 
aperiodic  Markov  chain  the  above  limits  must  independent  of  x  and  it  makes  sense 
to  write  lim^-^oo  P[Xt  =  y]-  Xn  irreducible  Markov  chain  is  aperiodic  if  and  only 
if  the  least  common  divisor  of  the  lengths  of  all  cycles  in  G  (counting  self  loops  as 
cycles  of  length  1)  is  equal  to  1.  The  linear  operator  interpretation  of  aperiodic  is 
that  all  eigenvalues  of  the  operator  are  real  (and  >  —1).  If  the  chain  is  irreducible 
and  aperiodic  then  the  eigenvalue  1  occurs  with  multiplicity  exactly  one  and  the 
unique  eigenvector  tt  such  that  1  •  tt  =  1  corresponding  to  this  eigenvalue  is  called 
the  stationary  distribution.  We  then  have  Vy  €  V  tTj,  =  lim^^oo  P[Xt  =  y]  and 
TTj,  >  0  for  all  y  €  V. 

An  irreducible  aperiodic  Markov  chain  is  time  reversible  if  V(a;,y)  € 
E  'K{x)P{x,y)  =  7r{y)P(y,x).  Time  reversibility  has  a  number  of  interpre¬ 
tations.  The  most  obvious  is  that  for  every  edge  {x,  y)  G  E  we  have  (y,  x)  G  E 
and  limt_^oo  P[Xt  =  x  A  Xt+\  =  y]  =  lini<^.oo  P[Xt  =  y  A  Xt+i  =  x],  or  each 
edge  direction  is  equally  likely  in  the  limit.  In  terms  of  linear  operators  if  a  Markov 
chain  is  irreducible,  aperiodic  and  time  reversible  then  the  matrix  DPD~^  is  sym¬ 
metric  where  Z)  is  the  |y  |  x  |y  |  diagonal  matrix  such  that  Thus 

P  is  in  fact  similar  to  a  diagonal  matrix.® 

From  now  on  we  will  assume  our  Markov  chain  is  irreducible,  aperiodic  and 
time  reversible.  We  have  for  such  chains  if  q  is  any  vector  that  obeys  the  time 
reversal  equation  Vrc,  y  gV  P{x,  y)qx  =  P(y,  x)qy  then  q  =  Xti  for  some  scalar 
A.  This  property  is  very  useful  in  designing  Markov  chains  with  specific  stationary 
distributions. 

If  G  is  a  strongly  connected  symmetric  graph  ((a;,  y)  G  E  {y,x)  G  E)  there 
is  a  very  powerful  general  method  of  designing  a  Markov  chain  by  choosing  the 
transition  probabilities  according  to  the  “Metropolis  filter.”  Suppose  we  have  a 
positive  function  F  and  we  wish  that  the  stationary  distribution  tt  of  our  Markov 

®It  is  interesting  to  notice  that  time  reversibility  taken  alone  does  not  have  such  a  simple 
interpretation  as  the  3  state  Markov  chain  with  transition  matrix 

r  i  i  0  1 

2  2  Y 

p  —  0  1  i 

r  —  V  2  2 

0  0  1 

is  not  similar  to  any  symmetric  or  diagonal  matrix. 
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chain  has  'Kx  =  F{x)liJ2yev  F{y))  €  V.  For  a;  €  V  let  N{x)  be  the  set  of  all 

y  y  /  (^?y)  ^  F.  Let  Y  be  maxi^gy  |^(a^)|,  the  largest  degree  of  any 

vertex  in  G  and  define 


P{x,y)  =  { 


2T  F(f)) 

1  -  EzeN(x)  z) 

0 


y  G  N{x) 
x  =  y 
otherwise 


(1.1) 


The  Markov  chain  with  transition  matrix  P  is  irreducible,  aperiodic  and  time- 
reversible.  The  irreduciblity  come  directly  from  G.  Aperiodicity  because 
P{x,x)  >  Oforalla;.  The  chain  is  time-revisable  because  if  x  /  yandP(x,y)  ^  0 
then  it  time-reverses  with  F: 

P(x,y)F(x)  =  imin(l,fg)F(x)  = 

^mm{F{x),F{y))  = 

^min(l,fg)P(y)  =  P{y,x)F{y). 

Furthermore  this  is  enough  to  show  that  the  unique  stationary  distribution  n  is  such 
that  TTx  =  F{x) / (EyGV  Fiv))  ^  It  Is  very  important  that  F  enters  into  the 
transition  probability  equation  as  a  ratio-  so  one  never  actually  has  to  pre-compute 
the  normalizing  factor  J2y^v  F{y)  as  this  factor  is  often  the  count  or  volume  that 
one  needs  the  Markov  chain  to  estimate. 

We  want  our  Markov  chain  to  be  rapidly  mixing.{3^  To  explain  this  property 
we  pick  a  measure  of  similarity  between  distributions  on  V.  Let  pA  and  ps  be 
two  probability  distributions  on  V.  The  distance  from  pA  to  pb  can  be  defined  a 
number  of  ways: 

{It  j/gv  Ipa(j/)  Ps(y)|^  /^distance 
maxj,€v  \pA{y)  -  PB{y) I  loo  distance 

chi-sq  distance  with  respect  to  pA . 

We  use  l\  or  variational  distance  because  it  is  the  distance  used  in  many  of  the 
papers  in  the  literature  and  has  a  slightly  simpler  interpretation.  Let  d(pA,PB) 
denote  one  of  the  above  distance  measures.  Let  pt  be  the  distribution  at  time  time 
({pt)y  P[Xt  =  y])  it  is  well  known  that  for  any  Markov  chain  as  above  that  there 
is  a  constant  x  <  1  such  that  d{7r,pt)  <  x‘/(7r,  po)  where  /  is  some  ftmction  of  tt 
and  po-  For  the  chi-sq  distance  we  can  take  /(tt,  po)  =  T,yev  (which  is 
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again  the  chi-sq  distance,  simplifying  subsequent  analysis). [27]  A  Markov  chain 
is  rapidly  mixing  if  1/(1  —  x)  is  smaller  than  some  polynomial  in  parameters  we 
care  about. 

We  will  be  looking  at  chains  whose  states  are  lattices  in  R"  intersected  with 
bodies  of  diameter  d  whose  transitions  correspond  “King’s  moves.”  In  this  case 
rapidly  mixing  will  mean  that  1  /  ( 1  —  x)  is  bounded  above  by  a  polynomial  in  n  and 
d.  It  would  be  nice  if  rapidly  mixing  had  a  simple  definition  such  as  “polynomial 
in  log(|V|)”  but  it  is  easy  to  show  that  Markov  chain  on  the  points  in  a;  G 
with  Xi  <  d  has  d"  states  and  has  1/(1  -  x)  ^  0{nS)  which  is  not  polynomial 
in  log(|y  I)  =  n  log  d.  We  wish  to  consider  this  chain  rapidly  mixing,  so  we  are 
forced  to  abandon  any  such  simple  definition  of  rapidly  mixing.  It  is  important 
to  notice  that  the  graph-diameter  of  such  a  Markov  chain  is  0{nd)  thus  our  chain 
is  rapidly  mixing  in  the  sense  it  mixes  in  time  polynomial  in  the  diameter  of  the 
chain  (instead  of  the  much  weaker  property  of  mixing  in  time  polynomial  in  the 
number  of  states). 


1.4  Some  geometric  Markov  chains. 

Here  we  quickly  outline  some  of  the  typical  Markov  chains  used  to  generate  points 
from  convex  regions.  All  the  Markov  chains  in  this  section  are  designed  by  picking 
a  finite  set  of  points  from  inside  a  convex  region  as  the  set  of  states.  The  underlying 
graph  G  =  (V,  jB)  of  the  Markov  chain  is  then  chosen  as  above  and  such  that  for 
each  state  it  is  easy  to  generate  a  neighbor  with  probability  P{x,  y). 

1.4.1  King’s  moves 

The  idea  of  “King’s  moves”  (used  in  both  the  optimization  problem  and  for 
contingency  tables)  is  to  have  the  set  of  states  of  the  Markov  chain  be  some  affine 
translate  of  the  standard  integer  lattice.  The  edges  in  this  Markov  chain  are  the 
states  that  differ  by  one  unit  in  exactly  one  coordinate.  These  are  not  the  moves 
that  the  King  makes  in  chess  (which  would  allow  a  unit  difference  in  any  number 
of  coordinates)  but  for  lack  of  a  better  name  have  been  called  “King’s  moves”.  In 
this  scheme  each  state  in  the  Markov  chain  is  identified  with  the  cube  consisting  of 
points  closest  to  the  state.  States  roughly  correspond  to  volume  elements  and  edges 
correspond  to  facets  of  cubes  (or  area  elements).  This  correspondence  of  states 
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and  transitions  to  simple  geometric  objects  is  of  central  importance  in  analyzing 
the  behavior  of  these  Markov  chains. 

1.4.2  Ball  moves 

Ball  moves,  and  the  related  “Normal  moves”,  also  have  a  strong  geometric  inter¬ 
pretation.  In  ball  moves  each  point  in  R”  is  a  state  in  the  Markov  chain  (actually 
one  is  forced  to  discretize  space  on  a  very  fine  grid,  but  for  all  intents  and  purposes 
one  deals  with  a  Markov  chain  on  R”)  and  transitions  are  vectors  drawn  uniformly 
from  a  ball  (or  in  the  normal  case  vectors  drawn  from  R”  using  the  normal  density 
function).  A  transition  corresponds  to  adding  a  vector  to  the  current  state  to  get  a 
new  state.  This  is  very  much  like  a  discrete  time  Brownian  process.  In  this  treat¬ 
ment  one  rarely  discusses  individual  states  but  instead  reasons  about,  measurable, 
collections  of  states.  The  probability  of  moving  from  a  given  state  into  a  set  of 
states  is  then  the  proportion  of  the  measure  of  the  step  set  (ball  or  normal)  that  is 
inside  the  destination  set. 

1.4.3  Gibbs  sampler 

The  Gibbs  sampler  is  a  classic  method  used  in  statistics.  What  one  does  is  either 
run  through  coordinate  directions  in  a  cyclic  order  or  pick  them  at  random.  When 
one  has  coordinate  direction  one  then  generates  a  point  on  the  line  passing  through 
the  current  point  in  the  given  coordinate  direction.  The  idea  behind  this  method  is 
to  mix  perfectly  with  respect  to  one  variable  at  a  time.  The  hope  is  that  in  relatively 
few  nms  through  the  variables  that  the  process  will  mix.  Gibbs  has  the  advantage 
that  the  moves,  which  seem  quite  powerful,  are  often  quite  easy  to  implement. 
This  method  has  not  been  well  characterized  as  a  Markov  chain.  Most  proofs  of 
mixing  time  prove  only  that  Gibbs  is  not  much  worse  than  King’s  moves.  These 
moves  were  also  called  “Rooks  moves”  in  Applegate’s  thesis. 

1.4.4  Hit  and  run 

Hit  and  run  is  the  neune  of  an  important  class  of  stochastic  techniques.  [5 5, 56, 48, 9] 
Hit  and  run  is  used  in  non-linear  programing  and  it  is  of  interest  to  this  thesis  that 
it  was  recently  shown  to  be  rapidly  mixing  by  Martin  Dyer  and  Leen  Stougie.[25] 
Then  method  is  natural  but  has  the  weakness  that  no  one  has  identified  a  simple 
geometric  quantity  that  predicts  the  mixing  rate  like  isoperimetry  does  for  King’s 
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or  ball  moves.  In  Hit  and  Run  one  picks  a  direction  r  uniformly  from  the  surface 
of  the  unit  ball  and  then  in  one  step  generates  a  point  distributed  according  to  F 
(the  density  function  we  are  trying  to  achieve)  restricted  to  the  ray  emanating  in 
the  direction  of  the  ray.  It  is  interesting  to  note  that  this  chain  is  time  reversible 
without  any  application  of  the  Metropolis  filter,  just  by  the  fact  that  a  ray  r  and  — r 
are  generated  with  equal  probability. 

1.5  Counting/computing  volume. 

If  one  can  count  lattice  points  one  can  compute  volumes  and  the  converse  is  often 
true.  The  following  is  taken  from  Gruber  and  Lekkerkerker  [34]. 

A  lattice  polytope  is  a  polytope  with  all  integral  vertices.  For  a  lattice  polytope 
P  and  natural  number  k  we  have 


^{Z-nkP)  =  ^Li{P)F 

i=0 

where  the  Li  are  rational  numbers  depending  only  on  P  and  n.  Let  /  be  any 
function  from  the  set  of  all  lattice  polytopes.  We  say  that  /  is  unimodular  invariant 
if  for  any  lattice  polytope  P  and  unimodular  integral  transformation  U  and  integral 
vector  u‘. 

f{UPFu)  =  f{P). 

We  say  that  /  is  additive  if  for  all  lattice  polytopes  Pi ,  P2  such  that  P\  U  Pj  is 
again  a  lattice  polytope  then 

/(P,  u  P2)  +  /(Pi  n  P2)  =  /(Pi)  +  /(P2). 

A  theorem  of  Betke’s  shows  if  /  is  unimodular  invariant  and  additive  (volume  is 
such  a  function)  then  /(P)  =  Ya-q  \iLi{P)  for  all  P  where  L,-  are  as  above  and 
\i  are  scalars  depending  only  on  /  (independent  of  P).  Thus  the  ability  to  count 
lattice  points  is  in  some  sense  universal  for  a  large  class  of  invariants  of  integral 
polytopes. 

The  volume  of  a  convex  body  differs  from  the  number  of  lattice  points  in  the 
convex  body  by  an  amount  proportional  to  the  number  of  lattice  points  whose 
Vomoi  cells  are  not  completely  covered  by  the  body  or  its  compliment.  Often,  as 
is  the  case  with  contingency  tables,  the  body  contains  a  reasonable  sized  copy  of  a 
n-cube  and  therefore  the  volume  is  a  good  approximation  of  the  number  of  lattice 
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points  in  the  body.  This  allows  one  to  generate  a  lattice  point  from  a  distribution 
arbitrarily  close  to  uniform  (by  a  rejection  sampling  technique)  and  then  generate 
estimates  for  the  number  of  lattice  points  with  any  desired  level  of  accuracy. 
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Chapter  2 

Up  Monotone  Optimization 


2.1  Membership  model. 

We  are  interested  in  minimizing  a  linear  function,  c,  over  a  convex  body  K  C  R" 
presented  by  a  membership  oracle.  This  is  the  most  general  form  of  a  linear 
convex  optimization  problem.  Under  the  “oracle”  model  [33]  the  convex  body  K 
is  presented  as  a  tape  containing  (n,  a;,  r,  R)  where  n  is  dimension  of  space  we  are 
working  in,  «  G  Q”  is  a  rational  point  in  K,r  >  0  is  such  that  Br{x)  C  K  and  R 
is  such  that  K  C  Br{x),  where  Br{x)  denotes  the  ball  of  radius  r  centered  at  x. 
It  is  also  assumed  that  one  can  determine  if  y  G  /T  for  any  y  G  R".  In  this  model 
the  only  optimization  algorithms  known  to  run  in  polynomial  time  are  relatives  of 
the  classic  ellipsoid  method. 

A  large  problem  with  these  method  for  convex  optimization  are  that  separating 
inequalities  are  required  to  run  the  algorithm.  Given  a  point  a;  G  R”,  x  ^  K  a. 
separating  inequality  for  K  and  a;  is  a  vector  a  and  number  b  such  that  a  •  x  >  b 
and  '^y  e  K  a  ■y  <b.  The  fact  that  any  two  closed  disjoint  convex  sets  can  be 
separated  by  such  a  linear  relation  is  one  of  the  central  properties  of  convexity. 
Separating  relations  can  be  derived  from  a  membership  oracle  [33,  43]  but  the 
reduction  is  complicated  and  expensive. 


2.2  Algorithm  outline  and  time  bounds. 

Using  the  membership  oracle,  we  approximately  minimize  a  linear  function  over 
an  up-monotone  convex  feasible  set  in  the  positive  orthant  as  follows.  We  may 
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assume  a  suitable  upper  bound  on  the  variables  so  we  can  enclose  this  feasible 
region  in  a  rectangle  (in  n  dimensions).  At  the  heart  of  our  approach  is  a  positive 
real- valued  logarithmically  concave  function  F  on  the  rectangle  with  the  following 
properties:  (1)  the  integral  of  F  over  a  region  consisting  of  near-optimal  solutions 
is  at  least  a  constant  fraction  of  the  integral  of  F  over  the  whole  feasible  set,  and  (2) 
the  integral  of  F  over  the  feasible  set  is  at  least  a  constant  fraction  of  the  integral 
of  F  over  the  entire  rectangle.  Thus,  if  we  pick  a  “random  sample”  from  the 
rectangle  with  probability  density  proportional  to  F  (we  refer  to  this  throughout  as 
“sampling  according  to  F”),  we  would  get  a  near-optimal  solution  with  constant 
probability;  this  probability  can  be  boosted  by  repeated  sampling.  Our  algorithm, 
then,  is  simply  a  choice  of  F  (determined  by  two  parameters  a,  a  damping  factor 
favoring  feasible  points,  and  (3,  a  bias  favoring  points  with  better  objective  values) 
and  a  method  to  obtain  a  sample  according  to  F.  We  show  that  a  certain  biased 
random  walk  (on  the  uniform  grid  of  size  5,  to  be  determined  later),  starting  from 
a  feasible  solution  {x^),  is  indeed  able  to  pick  a  random  sample  from  the  feasible 
set  with  probability  (approximately)  proportional  to  F.  While  it  is  relatively  easy 
to  argue  that  in  the  steady  state,  this  random  walk  picks  a  sample  with  density 
proportional  to  F,  it  is  nontrivial  to  show  that  this  steady  state  is  approached  in 
a  polynomial  number  of  steps.  To  accomplish  this  central  result,  we  draw  on 
recently  developed  results  in  the  theory  of  rapidly  mixing  Markov  Chains  as  well 
as  on  random  walks  in  convex  sets  [23] ,  [5].  The  latter  paper  gives  a  technique  for 
sampling  from  log-concave  distributions  which  we  use  here,  although,  we  have 
tried  to  make  this  description  self-contained  by  giving  as  many  details  as  possible. 
Our  random  walk  can  be  executed  with  only  local  knowledge  of  F  as  well  as  a 
membership  (not  a  separation)  oracle  for  the  feasible  set. 

Given  an  instance  of  the  problem  (a  membership  oracle  for  K  and  objective 
function  c  >  0),  e  >  0  (relative  error),  «  >  0  (failure  probability)  and  an  initial 
feasible  point  ,  the  algorithm  succeeds  with  probability  at  least  1  —  k  in  finding 
a  €  R”  which  is  feasible  and  such  that  c  •  <  (1  +  e)(c  •  Rather  than 

come  within  e  of  the  optimal  in  one  long  random  walk,  we  develop  an  adaptive 
algorithm  which  improves  the  feasible  solution  in  stages  (by  iteratively  refining 
the  gap  between  a  known  feasible  solution  and  a  probabilistic  point-wise  lower 
bound  lower  bound  L  on  optimal  cost).  This  “staging”  of  the  algorithm  decreases 
the  run-time’s  dependence  on  e. 

Each  stage  begins  with  a  feasible  solution  x^  and  a  probabilistic  “lower  bound” 
L  where  if  there  is  a  feasible  x  with  c  -  x  <  L,  then  the  algorithm  has  failed.  (We 
will  of  course  ensure  that  the  probability  of  failure  is  low.)  We  refer  to  c  •  x^  —  L 
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as  the  “gap”  (at  the  beginning  of  the  stage).  At  the  end  of  the  stage,  we  have  a 
new  lower  bound  and  a  new  feasible  solution;  we  ensure  that  the  gap  at  the  end  of 
a  stage  is  at  most  1/2  of  the  gap  at  the  beginning.  We  stop  when  the  gap  is  at  most 
e  times  the  value  of  the  cost  lower  bound.  The  algorithm  is  described  in  detail  in 
figure  2.4  and  justified  in  section  2.7,  but  we  give  here  a  short  verbal  description. 

Each  stage  proceeds  as  follows:  the  feasible  set  is  enclosed  in  a  rectangular 
solid.  We  devise  a  log-concave  function  F  on  the  rectangle  with  the  two  properties 
described  above.  [The  function  F  has  two  components:  a  “penalty”  (called  the 
gauge  function  in  what  follows)  for  going  out  of  the  feasible  set  which  increases 
as  we  become  “more  infeasible”  and  a  bias  (drift)  which  favors  low  objective 
fimction  value.  In  some  vague  sense,  this  is  similar  to  Lagrangian  relaxation  with 
both  feasibility  and  optimality  represented  by  one  function.] 

We  then  discretize  by  dividing  the  rectangle  into  small  cubes.  We  perform 
a  random  walk  on  the  cubes  with  transition  probabilities  depending  on  F.  It 
will  be  easy  to  see  that  the  steady  state  probabilities  of  this  random  walk  will  be 
proportional  to  F.  We  will  also  show  fast  convergence  to  the  steady  state,  so  that 
after  a  polynomial  number  of  steps,  we  are  “close”  to  the  steady  state  probabilities. 

After  doing  the  random  walk  for  this  number  of  steps,  one  of  the  following  two 
scenarios  occurs: 

(i)  We  have  found  a  feasible  solution  whose  value  cuts  down  the  gap  by  a  factor 
of  at  least  1/2.  In  this  case,  we  replace  our  old  feasible  solution  by  this  and  go  to 
the  next  stage. 

(ii)  Otherwise,  we  have  (probabilistic)  proof  of  a  greater  lower  bound  and  we 
go  to  the  next  stage  with  this  new  lower  bound  (again  cutting  the  gap  down  by  a 
factor  of  1/2). 

Although  F  has  been  devised  accurately  to  have  the  desired  properties,  several 
errors  are  introduced  in  the  sampling  procedure.  There  are  errors  due  to  discretizing 
into  small  cubes,  due  to  the  inexact  computation  of  the  gauge  function  and  due  to 
the  fact  that  the  lower  bounds  are  only  probabilistic.  The  management  of  these 
errors  is  the  main  focus  of  section  2.5. 

Our  main  result  is  two  bounds  on  the  miming  time  of  the  algorithm.  [The 
running  time  is  bounded  above  by  the  minimum  of  the  two.] 

If  p  is  the  ratio  of  the  value  of  the  given  initial  feasible  solution  to  the  optimal 
value,  e  is  the  required  relative  error,  1  —  k  the  required  success  probability  and  n. 
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the  dimension  of  the  up-monotone  convex  feasible  set  K,  the  first  boimd  is 


( Qog  (7))"^og  (^)  (7)  \ 


If  in  addition,  we  are  given  >  0  such  that  Vy  6  K,  we  have  <  y  and  an 
a;“  such  that  there  is  an  optimal  solution  z  with  z  <  x^  (so  we  can  replace  K  by 
K  C\{x  \  x  <  a:“}),  then  we  also  have  a  bound 


0 


|a:“  —  x^ 


emiUi  x\ 


x'^  —  x^ 
eminia;| 


The  rest  of  the  chapter  is  organized  as  follows.  Section  2.3  notes  that  the 
CC  problem  falls  into  the  framework  of  up-monotone  convex  sets  and  has  some 
general  remarks.  Section  2.4  constructs  the  appropriate  log  concave  and  gauge 
functions  that  are  to  be  used  in  each  stage  of  the  algorithm.  Section  2.5  describes 
the  random  walk  to  be  performed  at  any  stage  of  the  adaptive  algorithm,  and 
contains  the  analysis  of  the  errors  introduced  due  to  discretization,  gauge  function 
approximation  and  walking  for  finite  number  of  steps.  In  section  2.6,  we  find  a 
bound  on  the  spectral  gap  of  the  Markov  chain,  which  allows  us  to  use  results  on 
rapid  mixing  to  provide  a  bound  on  the  number  of  steps  required  in  any  stage. 
In  section  2.7,  we  prove  the  adaptive  algorithm’s  correctness  and  run  time,  and 
present  two  variants  and  prove  their  run  times.  Proofs  for  certain  lemmas  have 
been  deferred  until  the  end  of  the  chapter. 


2.3  The  component  commonality  problem. 

The  problem  was  described  in  detail  in  the  Introduction.  If  U  denotes  the  matrix  of 
theujj ’s  there,  let  y  =  y{d)  =  Ud.  Under  the  assumption  that /i(-)  is  log-concave, 
it  is  easy  to  see  that  y  has  a  log-concave  density.  Let  D  denote  the  density  of 
the  y.  Let  hd  denote  the  corresponding  measure.  (So  for  any  measurable  set  S, 
fj.D{S)  =  fg  D.)  Then  the  feasible  set  K  of  stock-levels  can  be  expressed  as 

/T  =  {a;  €  R”  :  //D(dom(a;))  >  7} 

where  dom(a;)  is  {y  :  y  <  x}.  It  is  easy  to  show  that  K  is  convex  ([54]).  It  is  also 
clearly  up-monotone. 


20 


2.3.1  Why  component  commonality  is  convex. 

It  is  clear  that  component  commonality,  as  described  in  the  introduction,  is  an 
up-monotone  problem  contained  in  the  positive  orthant.  It  is  also  easy  in  practice 
to  find  an  initial  feasible  stocking.  So  if  the  feasible  set  is  convex  then  it  fits  into 
our  monotone  optimization  scheme.  Let  K  is  the  feasible  region  of  a  component 
commonality  problem  its  convexity  follows  from  a  theorem  of  Prekopa  [54]  but 
it  is  worthwhile  to  see  how  this  is  proven  directly  from  the  Bruim-Minkowski 
theorem. [10]  The  Brunn-Minkowski  theorem  is:  if  Vol  is  the  standard  volume 
function  and  A,  5  are  two  compact  convex  sets  in  R”  then  the  function 

/  :  [0, 11  ->  R  :  /(A)  =  Vol(AA  +  (1  -  A)B)'/” 

is  concave.  [10]  If  our  density  F  was  the  indicator  function  of  a  convex  set  B 
(1  if  a;  €  5,  0  otherwise)  then  the  component  commonality  problem  would  be 
convex  because  then  the  probability  that  a  stocking  x  is  sufficient  would  be  ex¬ 
actly  Vol({d  €  R”^'*’  \  Ad  <x})l  Yol{B)  and  we  would  have  for  any  two  feasible 
stockings  x,  y  and  A  €  [0,1] 

Vol({d  G  R”^+  \  Ad  <Xx  +  {\-  A)y 

>  Vol(A{deR’”+  \  Ad  <  x}  +  {I  -  X){d  eBF+  \Ad<y}y/"^ 

>  XWol{{d  e  R’^^  I  Ad  <  x}yi^  +  {1-  X)Yol{{d  e  \  Ad  < 

>  min  (Vol({d  G  R”^+  |  Ad  <  x}y^^,Yo\{{d  G  R”^+  |  Ad  <  j/ })*/’")  . 

Thus  Ax  +  (1  —  X)y  is  feasible  and  the  feasible  region  must  be  convex 
(since  x,y  were  arbitrary).  The  first  inequality  is  because  the  Minkowski 
sum  of  {d  G  R’”'*'  I  Ad  <  x}  and  {d  G  R”*'*'  |  Ad  <  y}  is  contained  in  the 
set  {x  G  R*”"^  I  Ad  <  Ax  +  (1  —  A)j/}.  The  second  inequality  is  the  Brunn- 
Minkowski  theorem  and  the  third  is  an  obvious  fact  about  concave  functions. 

To  extend  the  argument  to  more  general  F  we  define: 

A<,(x)  =  {de  R’^^  I  Ad  <  X,  F{d)  >  a} , 

the  set  of  points  with  density  >  a  that  do  not  exceed  the  stocking  vector  x.  Again 
by  simple  properties  of  the  Minkowski  sum  we  have  for  A  G  [0,1] 

Aa  (Ax  +  (1  -  X)y)  D  AAa(x)  -f  (1  -  X)Aa(y). 
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And,  by  Brann-Minkowski, 


'I 


(AAa(ic)  +  (l“A)Aa(l/)) 


1  > 


SO 


x([  lY+{l-X)(l  l)" 

\JAa{x)  )  \JAa{y)  J 

f  l>min(/  1,/ 

JAo{\x+{l-\)y)  \JAa(x)  JAaiy) 


Using  that  for  any  2; 


/  Fid)=  r  [  Ida 

J{d\Ad<z}  Ja—oJAa(z) 

we  can  integrate  over  levels  of  the  density  F  and  get 

[  G  >  min  ( I  gJ  g) 

JAo(x+{l-\)y)  \J^{x)  JAc(y)  / 

thus  if  x,y  €  K  then  (A®  +  (1  —  X)y)  €  K  and  K  is  convex. 


2.3.2  Hardness  of  discrete  version  of  component  commonality 

For  m  points  j/^, . . .  j/*”  in  R”  and  x  also  in  R”  let  dom(a;)  be  the  set 
{i  \  i  G  I  •  •  -  m,  X  >  y^}. 

Theorem  1  Given  m  points  y* ,  . . .  y'^  in  a  fraction  7,  a  positive  vector  c 
and  a  real  number  C,  deciding  if  there  exist  an  a:  6  R"  such  that  a;  >  y’  is  satisfied 
for  at  least  ■ym  different  i ’s  and  c  -  x  <(  is  NP  complete. 

Proof  sketch:  Let  G  =  (V,  be  an  undirected  graph  with  vertices  V  = 
{ 1 , 2  •  •  •  n}  and  edges  E.  Given  an  integer  k,  the  clique  problem  is;  “does  G  have 
a  complete  induced  subgraph  on  some  k  vertices?”  For  each  e  G  E  let  y^  G  R” 
be  the  vector  such  that 

e  _  J  1  vertex  i  is  an  endpoint  of  edge  e 
~  I  0  otherwise 
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Let  c  =  1, 7  =  (A:  -  1)A:/(2|£:|)  and  C  =  A;. 

Let  a;  be  a  solution  to  the  above  problem.  WLOG  assume  a:  is  a  0- 
1  vector  and  define  Gx  =  (14,£^a;)  to  be  the  induced  subgraph  of  G  where 

=  {*  €  V  I  3;,  >  1 }.  We  then  have  \Gx\  =  c  -  x  and  \Ex\  =  \  dom(a;)|.  So  we 
see  that  x  such  that  c-x  <k  and  |  dom(a:)  |  >  fc(A;  —  1  )/2  correspond  precisely  to 
induced  subgraphs  of  G  with  <  k  vertices  and  >  k{k  —  1)12  edges:  k-cliques.  □ 

This  result  compliments  the  recent  result  of  Martin  Dyer  and  Leen  Stougie[25] 
which  shows  that  two  stage  stochastic  programming  with  simple  recourse  is  jjP 
complete. 


2.3.3  General  remarks 

While  we  do  assume  that  a  membership  oracle  is  available  for  K,  we  do  not  assume 
that  a  separation  oracle  is  available.  By  a  theorem  of  Yudin  and  Nemirovskiii  it  is 
known  that  membership  and  separation  are  polynomial  time  equivalent  for  convex 
programming  problems  like  this  one  [33] .  But  the  conversion  has  a  large  exponent, 
and  the  approach  here  is  more  efficient. 

Instead  of  computing  the  integral  each  time  the  membership  oracle  is  called, 
we  could  instead  pick  m  samples  t/^, according  to  D  at  the  outset  and 
then  for  each  query  x  to  the  membership  oracle  for  K,  say  yes  to  the  query  if 
and  only  if  for  at  least  7m  of  the  y*,  we  have  x  >  y\  It  is  easy  to  show  by 
standard  techniques  that  for  m  large  enough,  this  suffices  as  a  an  approximate 
test  of  membership  in  K.  We  do  not  go  into  the  details  here.  Also,  in  the  actual 
situation,  either  the  y*  may  be  available  from  past  data  or  if  one  hypothesizes  a 
particular  log-concave  density  D,  we  may  draw  samples  according  to  the  density 
using  the  techniques  of  [5]. 

One  may  be  tempted  to  solve  the  discrete  problem:  given  m  points  y  * ,  , . . .  y™ 

in  R”,  a  fraction  7,  a  positive  vector  c  and  a  real  number  (,  does  there  exist  an 
a;  €  R"  such  that  a;  >  y*  is  satisfied  for  at  least  7m  different  i ’s  and  we  also 
have  c  •  a;  <  C  ?  ffi  this  generality,  we  show  (in  the  appendix)  that  the  problem  is 
NP-hard.  So  one  needs  to  exploit  the  special  nature  of  K,  namely  its  convexity. 

For  an  NLP  approach  to  the  CC  problem,  see  [37]  and  references  provided 
there. 
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2.4  The  function  F\  bias  and  damping. 


Let  K  be  an  up-monotone  convex  set  contained  in  the  positive  orthant  of 
R",  a  known  feasible  point  and  c  >  0  the  cost  vector  in  the  linear 
objective  function.  For  real  numbers  L  >  0,  T  >  0,  we  define  a  log- 
concave  distribution  R(l,t)  on  R"  such  that  if  L  <  mf{c-y  \y  £  K}  then 
{{y^K  \c-y  <  inf  {c-y  \  yeK}  +  T})/(Xb^l,t)  >  J  and  show 

how  to  use  the  membership  oracle  for  K  to  approximately  sample  fi:om  this  dis¬ 
tribution  in  an  efficient  manner. 

Let  x‘  be  such  that  ^y  E  K,  x^  <  y  and  let  be  such  that  Vy  G  it',  3z  £  K 
such  that  c  ‘  z  <  c  ‘  y  and  z  <  x^.  Note  that  by  the  up-monotone  property, 
e  K.  If  x^  and  are  not  explicitly  given  we  can  take  x^  =  0  and  a;“  such 

that  xf  =  (1/cj)  Cix{.  Let  Kl  =  K  0  {y  \c-y>L}  (we  will  later  further 

restrict  our  attention  to  a  “rectangle”  that  is  roughly  ja;  |  ®  <  «“  |). 

Definition  1  For  any  real  L  >  0,  T  >  0  let  “the  tip”  be  the  set  x  £  Kl  such  that 
c  •  X  <  inf  {c  •  y  \  y  £  Kl}  +  T. 

Let  denote  the  infimum  of  all  real  positive  numbers  A  such  that 

x“  +  £  Kl-  This  is  the  dilation  of  Kl  about  x“  needed  to  contain  x  and  is 

called  the  gauge  function  associated  with  Kb-  When  the  context  is  clear  we  will 
suppress  the  subscripts  and  use  f  instead  of 
F  will  be  of  the  form 

F(^x)  —  (2.1) 

where  f  and  a  are  positive  reals  to  be  determined  later.  We  take  F(x)  as 
the  unnormalized  density  function  for  a  log-concave  distribution  on 

{x  G  R”  I  X  <  x“}.  Note  that  is  the  bias  (in  favoring  better  objective 

values),  and  is  the  function  to  damp  this  bias  in  regions  that 

are  infeasible. 

The  selection  of  a  and  ^  is  done  in  two  stages:  (1)  we  fiirst  find  /?  such  that  at 
least  half  of  the  probability  mass  in  the  body  (according  to  the  density  R(l,t))  is  in 
“the  tip”,  and  then  (2)  find  a  such  that  at  least  half  the  mass  in  the  entire  rectangle 
is  in  the  body. 
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2.4.1  Getting  in  the  tip 

For  X  in  Kl,  the  distribution  F  is  a  function  of  the  c  •  a:  only  (since  ^(a;)  <  1). 
Lemma  1  IfL  >  0,T  >  0  and  jS  >  ^,  we  have 


Proof:  Let  c*  =  inf  {c  •  y  ]  y  G  Kl},  ^  G  Kl  such  that  c  -  z  =  c*  and  A*  be  the 
intersection  of  the  hyperplane  c-x  —  c*  +  T  and  K.  Clearly,  the  convexity  of  Kl 
implies  that  !KLr\{x-.c-x<c*+T}  ^  not  increased  if  we  replace  Kl  H  {x  c  •  x  < 
c*  +  r}  by  the  convex  hull  of  z  and  A*.  Also,  replacing  Kl  C\  {x  :  c  •  x  >  c*  +  T} 
by  the  truncated  cone  formed  by  intersecting  {x  :  c-x  >  c*  +  T}  with  the  minimal 
pointed  cone  with  vertex  2;  containing  A*  cannot  decrease  /  F  over  this  set.  So  it 
suffices  to  prove  the  lemma  with  Kl  equal  to  the  infinite  cone.  Then  the  ratio  of 
integrals  in  the  lemma  is 

(y)”  ^  area(A*)e-^^dA  _  e-^^d  A 

X?  area(A*)e“^^d  A  ic*  e  ^^dA 

We  change  variables  (and  consult  standard  integral  tables)  to  get: 

rA»->e-/»dA  k)  ■ 

1  ^  1 

By  picking  /?  =  ^  the  ratio  is  1  —  e“"  YlkZo  ff’  which  is  >  ^  forn  >  1,  so  any 

T) 

0>j;,  (2.2) 

will  do.  □ 

2.4.2  Staying  in  the  body. 

We  now  show  that  at  least  }  of  the  mass  is  in  the  body  Kl,  which  would  imply 
that  at  least  }  of  the  mass  of  B(^l,t)  is  in  the  tip  (the  near  optimal  feasible  region), 
for  a  suitable  choice  of  a. 
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Figure  2.2:  The  body  versus  a  cone. 
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Figure  2.3:  An  example  “thin”  cone. 
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Let  dKi  be  the  set  of  y  in  the  boundary  of  Kl-  For  any  C  >  0  all  of  Kl  can 
be  covered  by  a  collection  C  of  disjoint  cones  such  that  for  each  cone  r  E:  C,  we 
have  ||a;  —  y\\2  <  (  for  x,y  r  f]  dKi. 

Now,  if  less  than  half  of  the  mass  according  to  is  in  Ki  then  there  must 

be  a  cone  r  E  C  such  that  less  than  half  the  mass  according  to  restricted  to 
r  is  in  r  n  Ki.  By  the  arbitrary  nature  of  (  we  see  the  same  must  hold  for  some 
“infinitely”  thin  cone.  Chose  such  a  cone  and  let  y  be  the  point  at  which  the  cone 
intersects  dK  and  set  Aq  =  ||a;“  -  y||2. 


Lemma  2  Suppose 

a  >  max  ^3/?(c  •  —  L)  +  3n  —  5,  n(e^  +  1)  +  1^  (2.3) 

Then  the  mass  of  any  infinitesimal  cone  outside  of  Kl  can  be  shown  to  be  no  more 
than  the  mass  of  the  same  cone  inside  Kl  ,  thus  yielding  a  ratio  of  feasible  to  total 
of  at  least 


Proof:  For  the  proof  of  this  lemma  only,  it  will  be  convenient  to  multiply 
masses  by  so  for  any  set  S,  we  mean  by  the  mass  of  S,  the  quantity 

g/JcaT**  p_  jj^g  mass  of  the  cone  outside  Kl  is  given  by: 

r  dA 

JXo 

fOO 

/CO 

g(t-l)(n-l)g/3(c-x“-c-i/)t-a(t-l)  ^  ^ 

=  /{a  -0{c-x^  -c-y)-{n-  1)). 


For  the  mass  of  the  ray  inside  Kl  we  will  break  into  two  cases  depending  if 
(5{c  •  —  c  •  y)  is  >  2  or  not.  The  mass  of  the  ray  inside  Kl  is  at  least 


yi-\ 


dA 


p-\^pt(c.x--c-y) 


We  now  consider  the  two  cases. 
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Case  1:  0{c  -  —  c  •  y)  >  2:  An  integration  by  parts  gives  the  mass  of  the  ray 

inside  Kl  equals 

^ - K -  .I3(c.x--c.y)  _  _  1)  /■*  fi-2  mcx'^-cy) 

/3{c^  -  c-yy  ^  Vo  ^ 

>  — - ^ -  UPic-x'^-c-y)  _  (ft  _  1)  g(t-l)(n-2)  /3i(c.rr”-c.y) 

-  f3{c-x'^-c-yy  ^  Vo  ’ 

^  Xn^Pic^x^-cy) 

“  j3{c  •  x'^  —  c  -  yy  /?(c  •  a;“  —  c  •  y)  +  n  —  2  ' 

The  ratio  of  mass  inside  Kj,  to  outside  is  at  least 

(3{c  •  a;“  —  c  •  y)  —  1  a  —  (3{c  •  a;“  —  c  •  y)  —  n  +  1 
/?(c  •  a;“  —  c  •  y)  +  n  —  2  f3{c  •  x“  —  c  •  y) 

Since  l3(c  •  x“  —  c  •  y)  >2  and  a  >  3/3(c  •  x“  —  c  •  y)  +  3n  —  5,  the  ratio  is  at  least 

1. 

Case  2:  (3(c  •  x'^  —  c  •  y)  <  2:  The  mass  of  the  ray  inside  Kl  is  at  least 

r-Vt min  (l, 

yielding  a  ratio  of 

a  —  j3{c-  x'^  —  c  •  y)  —  {n  —  l)min^l,e^(‘^'^““‘^‘^)^ 

y^Qp(c-x'^-c-y) 

which,  by  our  assumption  on  (3,  is  at  least 

a  —  2  —  (n  —  1) 
ne^ 

and  since  a  >  n(e^  +  1)  +  1  this  is  at  least  1. 

□ 

Summarizing,  we  have  the  following: 
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Theorem  2  For  any  L>0,T>0  and  F  as  in  (2.1)  with  a,  ^  satisfying 

>  f 

a  >  max  (3^(c  •  a:“  —  L)  +  3n  —  5,  n(e^  +  1)  +  1^ 


then 


I  F>  (1/4)  /  F. 

J“  the  tip”  ^  JW 


2.5  Sampling  Procedure. 


We  now  show  how  to  approximately  sample  according  to  F  (=  First,  we 

discretize  the  rectangle  <  x  <  a;“.  Next,  we  find  an  approximation  for  F  in 
regions  not  in  Ki,.  Third,  we  devise  the  transition  matrix  of  a  Markov  chain  which 
realizes  the  desired  random  walk. 

Let  denote  the  vmit  vector  directed  in  the  ith  coordinate  direction.  Let  Cs{p) 
denote  the  cube  of  side  28  centered  at  p. 

We  will  divide  the  rectangle  x^  <  x  <  x^  into  small  cubes  of  side  28  where  8 
will  be  specified  later. 

U  =  ja:  e  R” 

J  M  ^(75(0:)  (2.5) 

X&J 


We  will  take  a  random  walk  on  the  graph  whose  vertices  are  the  set  U  of  centers 
of  cubes.  Also,  notice  that  even  though  there  may  hex  £  U  such  that  x'^  x^ 
do  have  x  +  (Jl  >  a:^  for  all  a;  G  U.  It  is  important  to  notice  that  the  l^o  diameter 


of  J  is  no  more  that 


x“  —  x^ 


+  48. 


Many  of  the  lemmas  require  that  8  not  be  too  large  with  respect  to  x'^ 
and  (3.  To  formalize  this  we  will  can  8  “fine”  if  we  have 


x^ ,  a 


-  ^  .  .minTx)'  -  x{)  1  . 

^  <  mm - .  , 

la  ly/npci 

and  we  will  often  invoke  the  following  result. 


(2.6) 


Proposition  \  If  8  is  ‘‘fine  ”  and  a,  /?  meet  the  conditions  of  Theorem  2,  then 
>  7((e^  +  1  )n  +  1 )  for  all  i. 
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2.5.1  Discretization  and  Approximate  Sampling  using  Member¬ 
ship  Oracle. 

Errors  due  to  two  sources  need  to  be  analyzed  here.  One  source  of  error  is  because 
we  discretize  the  region,  and  approximate  the  integral  of  F(x)  over  a  small  rectan¬ 
gle  by  J^(p)*(volume  of  the  rectangle),  where  p  is  the  center  of  the  rectangle.  The 
second  source  of  error  is  in  computing  the  gauge  function  for  points  outside  the 
feasible  region.  This  is  because  in  practice  we  may  only  be  able  to  calculate  F{p), 
an  approximation  for  F(p)  (using  the  membership  oracle  and  bisection  methods). 

In  this  section  (and  in  section  2.6)  we  will  need  for  every  p  eU  that  the  integral 
of  F  over  any  rectangle  C  centered  at  p  and  approximately  contained  in  Cs{p)  is 
well  approximated  hy  F{p)Vol{C)} 

More  precisely:  For  every  p  €.  J  v/e  need  to  determine  a  lower  bound  on  p 
such  that  for  some  continuous  monotone  decreasing  function  (  such  that  ^(0)  = 
and  all  >  0  in  some  open  neighborhood  of  0:  if  (7  is  any  rectangular  region  with 
center  p  contained  in  Cs+r){,p)  H  J  then 

We  will  also  need  a  lower  bound  on  a  such  that 

aF{x)  <  F{x  -f  \W)  (2.8) 

for  all  a:  €  Ff,  i  G  {1  •  •  •  n}  and  all  |A|  <  25. 

The  lemma  below  summarizes  the  errors  induced  here.  The  proof  is  in  sec¬ 
tion  2.8. 


Lemma  3  (2. 7)  and  (2.8)  hold  with 


O’  > 
P  > 


g-2a5/(mini(!cf-a;f))g-2/35||c||^ 


1  -|-  y/lTr/SS  ||c||2  erf 


(2.9) 

(2.10) 


*The  approximate  containment,  characterized  by  a  parameter  rj,  is  technical  point  used  only  to 
facilitate  the  proof  of  Lemma  4. 
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Let  F (a;)  be  an  approximation  for  F(a:)  calculated  using  only  x  and  the  mem¬ 
bership  oracle.  In  the  feasible  region,  F  =  F.lxi  the  infeasible  region,  F  may  not 
equal  F  because  the  dilation  cannot  be  computed  exactly.  Note  that  for  our  analy¬ 
sis  this  must  be  a  deterministic  approximation  and  not  one  obtained  by  sampling; 
to  be  clear,  we  always  calculate  the  same  value  for  F(a;).  This  is  calculated  to  a 
relative  accuracy  of  1  ±  (i.e.  •|yF(a;)  <  F{x)  <  ^F{x).) 

To  calculate  F{x)  to  a  relative  accuracy  of  1  / 1 1 ,  it  is  sufficient  to  calculate  the 
gauge  function  to  an  absolute  error  of  Xo  achieve  this,  it  is  sufficient  to 

calculate  the  distance  of  the  point  in  K  on  the  line  segment  from  x  to  a;“  that  is 

farthest  from  x^  to  an  absolute  accuracy  of  ))  _ 

very  quickly  by  bisection  search  using  our  membership  oracle. 


2.5.2  The  Markov  Chain. 


For  each  x  ^  U  let  N{x)  be  the  “neighborhood”  of  x  which  is  the  set  of  all 
vertices  in  U  that  differ  from  x  in  exactly  one  coordinate  by  ±28.  The  transition 
probabilities  P{x,  y)  will  be: 


P{x,y) 


'  imin(l,|M)  yeN{x) 

^  1-T.zsn{x)P{x,z)  x  =  y 
0  y  ^  N{x) 


(2.11) 


where  F  is  a  deterministic  estimate  for  F.  It  is  easy  to  see  that  P(x,  y)  induces 
a  time  reversible  irreducible  aperiodic  Markov  chain  with  steady  state  probabilities 
7r(-)  proportional  to  F(').  We  will  show,  in  the  next  section,  that  after  a  sufficient 
number  of  steps,  we  are  fairly  close  to  the  steady  state.  Let  tt  be  the  vmique  steady 
state  distribution  for  our  chain  (approximately  F(i,,t)).  But  first  we  show  that  in 
the  steady  state  there  is  a  reasonable  chance  of  observing  states  corresponding  to 
cubes  covering  the  near  optimal  feasible  portions  of  JTi,. 


Theorem  3  Let  n{-)  be  the  steady  state  probabilities  of  the  above  Markov  chain, 
then  if  a,  (3  satisfy  the  conditions  of  Theorem  2  and  8  is  “fine”  and  “the  tip”  is 
contained  in  J  then 

x&J,Cs(x)r[  “the  tip  ”#0 
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Proof:  Accovinting  for  the  errors  due  to  discretization  and  in  gauge  function 
computation,  7r(a;)  (=  DF(x))  satisfies 


-  A)'’  ((Af  L  ^  + A)'’-  ((Af  L  ■ 

.  -1 

where  D  =  (e^gc/  f)  . 

Now,  since  the  “tip”  probability  is  at  least  |  (by  construction  of  F),  we  have 
the  steady  state  probability  of  the  cubes  covering  the  tip  is: 


lA  7r(a;)  > 

!!;Gi7,C'^(;c)n“the  tip’Vfl 


1  X  (1  -  ^)p 

3  x(l  +  i)p-i  +  l  x(l-dj)p' 


S  is  fine,  so: 

aS  ^  1 
mini(a;^  -  a;/)  “  7 
and 

and  the  theorem  follows  from  inequality  (2.10).  □ 


(2.13) 

(2.14) 


2.6  Spectral  Gap  of  the  Markov  Chain. 

In  this  section,  we  determine  how  many  steps  are  necessary  for  the  Markov  chain 
to  “mix”.  We  need  to  find  a  relationship  between  the  number  of  steps  walked  and 
how  close  we  are  to  the  steady  state.  For  this  we  need  a  central  result  of  Sinclair 
and  Jerrum  [47].  We  also  need  several  well-know  facts  that  are  collected  in  [19]. 
Let  X  be  the  set  of  states  in  our  Markov  chain.  For  x,y  e  X,  let  P\x,  y)  be 
the  probability  that  starting  in  state  x  we  are  in  state  y  after  t  steps.  As  before, 
let  TT  be  the  unique  steady  state  distribution  for  our  chain  (approximately  S(£^x))- 
Recall  that  P{x,  y)  induces  a  time  reversible  irreducible  aperiodic  Markov  chain; 
however,  it  is  not  strongly  aperiodic  [47].  This  is  because  we  do  not  insist  that  we 
have  P{x,  a;)  >  |  for  all  x.  Because  of  this,  we  not  only  need  an  upper  bound 
for  the  second  largest  eigenvalue  but  also  need  a  lower  bound  on  the  smallest 
eigenvalue  [19]. 

We  will  use  Proposition  3  from  [19]  which  says: 
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- '^{y)\ < 

yex 


1  —  7r(a;) 
7r(a;) 


(X*)‘ 


where,  %*  =  niax(xi,  |Xm|)>  where  xi  is  the  second  largest  eigenvalue  and  Xm 
is  the  smallest  eigenvalue  of  P,  the  matrix  of  transition  probabilities. 

We  find  an  upper  bound  on  xi  by  appealing  to  a  result  of  Sinclair  and  Jerrum’s 
[47],  which  is  quoted  also  as  Proposition  6  of  [19],  namely,  Xi  <  1  ~  where 

A 

0  is  a  lower  bound  on  the  “conductance”  (defined  in  section  2.6.1)  of  the  Markov 
chain.  We  find  a  lower  bound  on  the  conductance  in  section  2.6.1  below.  A  lower 
bound  on  Xm  is  obtained  by  appealing  to  Proposition  2  of  [19],  described  in  detail 
in  section  2.6.2. 

We  also  prove  that  (see  sections  2.6.1-2.6.2)  \xm\  <  1  —  Thus,  we  have: 

E  \P\^^y)-'^iy)\  <  “7^(1  “y)‘- 

ysx  ^ 

What  we  wish  to  do  is  find  t  so  that  Yly^x  \P^{^iy)  —  '^(y)l  is  under  1/12. 
Recall  that  by  Theorem  3,  the  states  of  our  Markov  chain  corresponding  to  cubes 
that  cover  the  “tip”  have  mass  at  least  1/6.  Thus,  we  will  have  a  chance  of  at  least 
1/6  —  1/12  =  1/12  that  a  random  walk  of  i  steps  will  end  in  one  of  the  states 
corresponding  to  a  cube  covering  the  tip  (i.e.  close  to  the  optimum).  For  this,  it 
suffices  to  have 


t  >  In 


21n(12)  —  ln(7r(a;)) 

(2.15) 


We  now  wish  to  prove  a  lower  bound  on  7r(a;-^),  our  starting  point,  so  that  we 
can  apply  the  above  inequality. 

We  assume  that  the  walk  is  started  deterministicly  at  xP  Since  we  know  that 
at  least  half  of  the  mass  of  is  in  the  body  and  the  highest  possible  stationary 
probability  of  a  cube  intersecting  Ki,  is  at  most  and  there  are 


at  most  nr=i 


2S 


+  1 


states  mKi  (or  even  U),  we  know 


<  t  i\  P{cixf+2S\)-L)  TT 

p-112/112  l\ 


xt  -  x\ 

IS 


+  1 
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which  yields 


TT 


(x-^)  > 


^2^^/3(L-c-(xf+2ST)) 


i2n”=, 


fLzM  +  l 

2S  +  ^ 


but  we  will  just  use  the  easier  form: 

5^2g;0(i-c.(^/+25r)) 


12 


r  1 

2S 


+  1 


Plugging  in,  we  get 


(2.16) 


Theorem  4  If  the  conditions  of  Theorem  3  are  met  then  running  the  above  Markov 
chain  for  at  least 


2 ln(12)  +  In  (^)  +  (3{c  ■  {x^  +  251)  -L)  +  nln 


(2.17) 


steps  is  sufficient  to  ensure  with  probability  at  least  the  chain  will  stop  at  a  state 

X  such  that  x  5\  is  feasible  and  within  cost  T  +  25  \\c\\^  of  the  optimal  point. 


Although  at  the  end  of  Section  2.4,  we  had  shown  that  the  steady  state  prob¬ 
ability  of  being  in  the  tip  is  at  least  f ,  now  we  only  guareintee  the  probability 
that  the  sample  is  in  the  tip  at  the  end  of  t  steps  is  (at  least)  Thus,  the  three 
sources  of  errors-discretization,  approximation  of  the  gauge  function,  walking  for 
t  steps-reduce  the  tip  probability  from  ^  to 


2.6.1  Conductance. 

Take  aribtrary  V  CU  and  V  =  U  \  V.  We  define  the  conductance  of  V  by 

,  _  J2x£V,y^vnN{x)  '^{x)P{x,  y)  _  Ylxev,y€Vr\N(x)  min(F(a;),  F{y)) 

min(7r(y),7r(y))  “  2nmin(f  (V),  F(y)) 

The  conductance  of  the  chain  defined  by 


6  =  min  4>v- 
^  vcu 


(2.19) 
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We  use  an  “isoperimetric  inequality”  to  find  a  lower  bound  for  (f).  An  isoperi- 
metric  inequality  was  proved  in  [23];  a  simpler  proof  of  a  stronger  inequality  was 
given  in  [40].  The  inequality  was  generalized  to  the  case  of  log-concave  functions 
in  [5].  We  use  here  a  version  of  this  from  [21],  If  dist(a;,y)  =  ||a;  —  y||  where 
1 1  •  1 1  is  an  arbitrary  norm  on  R"  and  diam(ii")  =  max^., y^K  dist(a;,  y)  we  have  the 
following  theorem  from  [21]: 

Theorem  5  Let  J  CR^  be  a  convex  body  and  F  a  log-concave  function  defined 
on  int  J  and  y,  the  induced  measure.  Let  5'i,  5*2  C  J,  and  t  <  dist(S'i,  and 
d  >  diam(  J).  IfB^J\{SxU  S2),  then 

min(//(5i),^(5'2))  <  ^(d/t))u(5).  (2.20) 

We  will  use  dist{x,y)  =  ||a;  —  j/||^. 

Lemma  4  IfS  is  fine  then  4>  >  ,  n  f  i'ri . =  (j)  (say). 

jTl  M  iC  ~X  1 1 
M  M  00 

Proof:  For  any  V  CU  and  V  —  U\V,  and 

Vs  =  [j  Csix) 

x&V 

Vs  =  [j  Cs{x) 
xev 

Let  ?7  be  a  small  positive  real  that  will  tend  to  zero.  Let  be  the  77/2  neighborhood 
of  n  Vi  and  Bs{x,  y)  be  the  77/2  neighborhood  of  Cs{x)  D  Cs{y).  Let  81,82  and 
BheVsX  Bs,Vs\  Bs  and  Bsf\J  respectively. 

From  inequalities  (2.7),  (2.8),  (2.9),  (2.10)  and  (2.12),  it  is  clear  that 

^  min(F(x),  F(y))  > 

x^V,yQVr\N(x) 

> 

> 


10 

11 


^  min(F(a:),F’(y)) 


—a  V  +  y 

11  ^ 

-77  V  1  f  p 
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> 

> 


10 


ap 


=r/. 


1 1  r}{2S  +  r]Y~^  Jbs 
10  <7/9 


11  r}{28  +  r}Y~^  Jb 

similarly  mm(F(V),  F(y))  <  ^-^min  (/^  F,/.  F 


< 


UapilSY 
12  1 
11  (Tp{25  —  viY 


Vi 
mm 


’Vi 


From  the  isoperimetric  inequality, 

IbF 


> 


2t] 


^MfsiFJs^F)  11®“ -x' 


+  45' 


^—n 


M  +4sy 

Since  5  is  fine  (from  inequalities  (2. 10), (2. 9)) 


0> 


3n  ||®“  —  xH 


A 

=  <!>. 


Combining  this  with  the  inequalities  above  and  taking  the  limit  r/  ^  Owe  get: 

2^(7^pi^5 


(2.21) 


(2.22) 


□ 


2.6.2  Comparison  ofxi  and  \xm\ 

Here  we  find  a  lower  bound  for  Xm,  by  the  canonical  odd  path  argument  outlined 
in  Proposition  2  of  [19]. 

Let 

A  =  TT-n - (2.23) 

8n  ||3;“  —  ®  ||<5o 

For  each  state  x  let  be  the  smallest  non-negative  integer  such  that  P{x  + 

2iOx5E\x  +  2u)xSEY  >  A  (this  is  always  possible  since  P(a,  a)  >  ^  >  A  on 
the  border  of  our  bounding  region).  Let  be  the  2a;a;  +  1  step  path  of  from  x  to 
x  given  by: 

“self  loop” 

^  I — ^  X  ~f-  25E^  I — ^  X  +  45 E^  i — ¥  •  •  •  ®  -I-  2u)x5E^  \ — ^  x  +  2ol)x5 E^  i — y  •  •  ■  x  ~t~  5E^  i — h  x  25 E  h 
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We  will  call  “the  canonical  odd  path  for  x'\ 

Proposition  6  of  [19]  states  for  any  selection  of  canonical  odd  paths  we  have 
Xm  >  -1  +  ^  where 


^  Ikxl|p7r(a;), 

<T.3(a,6) 

where  ||(T^||p  =  ^  ^  . 


(2.24) 

(2.25) 


To  prove  the  bound  for  x*,  we  will  show  that  |xm  |  is  less  than  the  upper  bound 
for  xi-  From  the  discussion  above,  it  is  sufficient  to  show  the  following. 

Lemma  5  c  <  4r 

—  0-; 

Proof:  By  our  choice  of  paths,  we  have  for  any  i  <  a;a,  —  1 : 

^-A  <  P{x  +  i25E\x  +  {i  +  \)2SE^)  <  ^ 

Plx  +  {i  +  l)2SE\x  +  i2SE^)  <  £ 

and  by  time  reversibility 

P{x  +  i2SE\x  +  {i  +  l)2SE^) 


'K{x-\-{i-\-\)25E)  = 


P(x  +  (i  +  \)2SE\x  + 


For  any  x  we  have. 


^X\\p 


<  2E 


1 


+ 


1 


^  7r(z)(l  —  2An)*(^  —  A)  7r(a;)(l  —  2An)^^A 

j’ 

^+1 

<2  E  — -  •  s./  1 - TV  + 


1 


i=\  7r(a;)(l -2An)*(^-A) 


7r(a;)A(l  —  2An) 


—x\ 

+* 


<  2 


X  2  —  X  2 


26 


+  1 


2n 


/ 

\  +  / 

7r(a:)  f  1  — 

fizA  +  2 

25  ^  ^ 

2Anj  ff(a;)Afl  — 

25  ^  ^ 

2An 


Using  Proposition  1  it  is  easy  to  show  that 


—  X\ 


25 


+  2 


2An  <  -. 
“  3 
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Continuing, 


p  ^ 


_  I 

0-1  O  1 


25 


+  1 


6n  3 

+ 


Because  each  edge  can  be  used  by  at  most 
we  have 


7r(a;)  27r(a;)A' 

canonical  odd  paths, 


_ 11  I  1 


25 


L  < 


O  1  O  1 


28 


+  1 


o  1  o  1 


28 


+  1 


^  3  ^ 

6n  +  — 
2A, 


(2.26) 


The  result  now  follows  from  Lemma  4,  inequality  (2.26)  and  Proposition  1 .  □ 


2.7  Description  and  Analysis  of  the  Algorithm. 

Suppose  we  are  given  a  feasible  point  x^,  a  relative  accuracy  goal  (e)  and  a  desired 
upper  bound  on  the  probability  that  the  algorithm  fails  (k).  Let  u  be  the  ratio  of 
c-  to  c-  x°^\  We  assume  that  without  loss  of  generality,  the  problem  has  been 
rescaled  so  c*-  =  1  for  all  i  (xi  — )■  CiXi,  Ci  =  1). 

Cl 


2.7.1  In  the  worst  case 

Here,  we  assume  that  n  >2  and  e  <  1 .  We  present  AlgorithmA. 

The  analysis  of  AlgorithmA  is  fairly  straight  forward. 

•  It  is  easy  to  see  that  such  that  xf  =  must  dominate  any  optimal 

point. 

•  Each  time  we  draw  a  sample  according  to  Theorem  (4)  we  have  at  least  a 
chance  of  that  it  is  feasible  and  within  a  cost  of  T  +  2nS  of  the  optimum. 

V  xf-L 

We  have  picked  T  and  8  such  that  T  +  2n8  <  — ^ — . 

•  When  the  repeat  loop  terminates  either 
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Input:  x^,  e,  k,  c  and  a  membership  oracle  for  K. 
rescale  problem  so  c  =  1 
a;'  ■(-  0 
L  4-  0 


^  x^ 


while x(  —  L>  eL 

4-  2  Ej  ,  z  =  1 
y.x{-L 

^3  3 


(point- wise  lower  bound) 
(probalistic  cost  lower  bound) 

(the  best  feasible  solution  observed) 


X 


n 


Tf- 
/?f-f 

T 


a  4— 


(point- wise  upper  bound) 

( I  of  the  current  gap) 

(objective  function  “bias”) 

(gauge  function  gain) 

(step  size) 

times  or  until  ^ 

run  the  random  walk  of  section  2.5  with  the  above  parameters  for  the  number  of 
steps  prescribed  in  Theorem  4,  let  a;  be  the  stopping  point  of  the  walk, 
if  X  -t-  51  feasible  and  I],(xj  -|-  5)  < 
then  x'’®®‘  4—  X  +  51 

endrepeat 


repeat 


10gl2 

11 


^f'°g2(Dl+‘ 


ifEia:^^*> 


J 


then L  4-  E* a:r‘  -T -2n5 


„best 


endwhile 
return  x^ 


Figure  2.4:  AlgorithmA. 
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-  enough  samples  have  been  drawn  such  that  with  confidence  at  least 

I 

(i_«)Ka)i+‘ 

one  of  them  was  feasible  and  within  distance  T + 2n8 

of  the  optimum. 

Either  way,  the  distance  from  the  new  to  the  new  L  is  no  more  than  half 
of  the  old  distance. 


•  Thus,  the  outer  while  loop  will  run  no  more  than  [logj  ]  times. 

•  Furthermore,  we  see  the  first  time  the  algorithm  alters  the  lower  bound  (it 

must  establish  a  lower  bound  to  halt),  we  have  L  >  -T-2nS> 

-  y- 

7  *^2  * 


-  Therefore,  the  outer  while  loop  will  cycle  no  more  than  |^log2  Q) 
more  times. 

-  Thus  the  lower  bound  can  be  altered  at  most  [^log2  (7)  +1  times. 

-  Since  each  lower  bound  alteration  is  correct  with  chance  at  least  (1  - 

1 

«;)  ['‘’82(1)]+!  ggg  ijjgy  gii  gj-g  coiTect  wlth  odds  at  least  \  —  k. 


Thus  the  algorithm  fails  with  chance  less  than  k. 

We  must  check  that  a,  (3  and  5  were  picked  correctly. 


•  (3  clearly  satisfies  inequality  (2.2). 

•  Assuming  that  n  >  2,  we  see  that  a  satisfies  inequality  (2.3). 

•  It  is  easy  to  see  that  inequality  (2.6)  is  satisfied. 

•  Invoking  Lemma  (4),  ^ 

(6174)  V  (7+3.2n+n  In  [  ] 

•  Sot  =  - - ^2 — - —  steps  are  enough  to  draw  a  sample. 

Thus,  each  sample  can  be  drawn  in  O  steps. 

•  Each  step  requires  at  most  0  (log  membership  queries  to  compute  the 
gauge  function. 
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So  the  total  number  of  membership  queries  is 


O 


(log  (7))^og  log 


{221) 


which,  if  we  ignore  lesser  log  factors,  can  be  thought  of  as 


log(i/)log(^) 


It  is  quite  natural  to  wonder  if  an  algorithm  with  this  poor  a  runtime  can 
possibly  be  an  improvement  on  the  ellipsoid  algorithm.  In  [33]  the  version  of 
the  Yudin-  Nemirovskiii  separation  from  membership  algorithm  requires  0~(n^) 
membership  tests  to  build  a  single  approximate  separator  and  then  the  shallow 
cut  ellipsoid  algorithm  may  need  separators  to  optimize.  It  must  be 

emphasized  that  this  reference  in  only  concerned  with  proving  the  problem  is  in 
P,  so  it  is  unlikely  that  0~(n^®)  membership  queries  is  the  best  known.  Finally  it 
should  be  emphasized  that  the  ellipsoid  algorithm  doesn’t  have  a  super  logarithmic 
dependence  on  e. 

Remark:  1  With  a  new  result  of  Frieze,  Kannan  and  Poison  that  the  algorithm 
outlined  here  can  have  it ’s  dependence  on  n  brought  down  to  rf  by  performing  the 
sampling  walk  in  a  bounding  sphere  instead  of  a  bounding  box  and  estimating  x* 
without  introducing  the  idea  of  conductance.  Other  methods  such  as  using  newer 
chains  available  from  Lovasz  and  Simonovits,  or  Kannan,  Lovasz  and  Simonovits 
allow  running  times  of  O(n^)  and  possibly  0{rt?).  None  of  these  improvements 
are  capable  of  replacing  the  dependence  on  e  with  one  on  log(e) 


2.7.2  With  advantageous  bounds. 

Here  we  analyze  the  situation  where  good  bounds  L  and  are  known  and 
attempt  to  lower  the  dependence  of  the  runtime  on  n.  To  do  this  meaningfully,  all 
dot  products  (and  norms  other  than  infinity)  must  be  removed  from  the  expressions 
as  they  hide  n’s.  In  this  subsection,  we  work  out  a  bound  on  run  time  that  explicitly 
shows  all  of  the  powers  of  n. 

We  still  assume  that  the  problem  has  been  rescaled  so  Ci  =  lforalH(a:i  CiXi, 
c,  ^  =  1)  and  that 

minj(a;y  -  x{)  ^  ]_ 

|a;«||^ -minta;|-  “  2 
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(2.28) 


and  min(a;“)  +  min(a;|)  >  2 


This  is  easy  to  ensure  by  replacing  with  +Ik1 

II  II  oo 

interpretation  of  making  the  problem  “well  rounded”. 

'1®^" 


(2.29) 

and  has  the  geometric 


We  notice  that  if  e  %■  —  1  then  is  already  a  solution  of  the  desired 

min<  ajJ  •' 

accuracy.  This  and  inequality  (2.29)  imply 


e  mm;  x] 


(2.30) 


Rather  than  the  adaptive  approach,  first  consider  a  sampling  algorithm  that 
comes  within  e  of  optimal  in  one  long  walk: 


•  We  set  L  = 


and  T  =  e 


X 


We  will  chose  (3  to  be  at  least  ^  instead  of  as  stated  in  inequality  (2.2). 
The  is  required  to  guarantee  that  we  get  within  T  of  the  optimum  instead 
of  the  T  +  2<5  ||c||i  we  could  expect  because  of  discretization.  To  guarantee 
this,  we  must  show  that  ^  min;  x\  >  2n5. 


-  So  we  set 


f3  = 


11 


10emin;a;; 


and  O’  = 


€  mm;  xl 


(2.31) 

(2.32) 


By  inequality  (2.30),  this  satisfies  inequality  (2.3). 
Now  setting 


e  mm;  x\ 


satisfies  inequality  (2.6),  by  inequality  (2.28). 
-  Clearly,  we  have  ln5  <  -^e  min;  x[. 

So  the  26  ||cl| ,  factor  has  been  dealt  with. 


(2.33) 
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Now,  by  Lemma  (4)  and  Theorem  (4),  we  have 


e  mm;  x\ 


210n2  ||a;“  — 


t  = 


7  + 


lln 


X 


OO 


lOemin,  x 


+  nln 


70n 


x^  —  x^ 


2e  mini  a;  • 


+  1 


^210n^ 


x^  —  x^ 


emmi  x 


(2.34) 


Which,  if  we  take  the  middle  term  of  the  sum  to  be  dominant,  is 


/ 


0 


n 


x^  -x‘ 


e  mm*-  x 


(2.36) 


steps. 


We  will  call  this  algorithm  AlgorithmB.  AlgorithmB  can  then  be  repeated 


ln(K) 

ln(ll/12) 


times  to  amplify  the  chance  of  success  to  at  least  1  —  k. 


The  dependence  on 


can  be  improved  by  designing  a  new  algorithm 


(AlgorithmC)  that  runs  AlgorithmB  in  stages  like  we  did  for  AlgorithmA. 
The  analysis  is  as  before  and  the  run  time  comes  out  to 


/ 


O 


n 


x^-x‘ 


In 


n 


x^  -x' 


V 


cmmiXl 


emmiXi 


ln(i) 

K 


(2.37) 


membership  queries. 


2.8  Errors  Due  to  Discretization 

For  every  p  G  J  we  need  to  determine  a  lower  bound  on  p  such  that  for  some 
continuous  monotone  decreasing  function  (  such  that  i^(0)  =  0  and  all  p  >  0 
in  some  open  neighborhood  of  0:  if  (7  is  any  rectangular  region  with  center  p 
contained  in  Cs^r]{p)  H  J  then 

(vJm  i  (vm  L  ■ 

(2.38) 
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We  will  also  need  a  lower  bound  on  a  such  that 


crF{x)  <  F{x  +  A£J®) 


(2.39) 


for  all  a;  6  i  €  {1  •  •  •  n}  and  all  |A|  <  25. 

Forvectorsaand61eta6bethevector  (a6)i  =  Uihi.  To  facilitate  the  analysis  we 
break  F  into  its  two  constituent  parts  /and  g  where  f{x)  = 
g{x)  =  and  F{x)  =  f{x)g{x). 

cr  is  easy  to  deal  with  when  we  apply  the  well  known  fact  that  a  gauge  function 
based  on  K  can  fall  no  faster  than  one  based  on  a  convex  subset  of  K  (containing 
a;“).  To  be  precise  we  apply  Corollary  52  from  [5]  to  get: 


\'fp(KL,x'^){x)  -  '>P(KL,x'‘){y)\  < 


Ik-ylL 

mini(a;f  -  x{) 


which  implies 

and  inequality  2.39  is  satisfied  with 


(y  =  g-2a^/(mini(®f-®/))g-2/95||c||^_ 


(2.40) 

(2.41) 


To  get  a  lower  boimd  on  p  we  use  the  simple  rule  that  for  /,5'>  0 
mn(/(x))/s,<//s<mj«(/)/s, 

and  use  inequality  2.40  to  get  the  point- wise  bounds  on  /.  All  that  remains  is  to 
derive  a  lower  bound  p'  such  that 


We  note  that  g  is  of  the  form  g{x)  —  h{c-x)  for  some  non-negative  convex  function 
h  in  the  region  we  are  interested  in  and  since  C  is  symmetric  about  p  we  know 
that  (from  Jensen’s  inequality) 

and  any  <  1  satisfies  the  right  side  of  inequality  2.42. 


46 


To  get  the  left  side  of  inequality  2.42  we  define 


Si  =  inax{A  €  R  I  p  +  AF  €  C} 

D  {a:  G  R"  |  |a;i  —  Cipi\  <  CiSi  Vz} 
and  change  variables  to  get: 

riiLi  2ci5i  Id^' 

Let  p(  A)  be  the  measure  of  the  set 

£  D  a;-l— p-c>A| 

It  is  easy  to  see  p  is  differentiable  and  — p'(A)  >  0  for  A  G  [0,  c  •  S].  We  return  to 
our  integral  (using  the  estimate  p(A)  —  p(A  +  d\)  =  — p'(A)dA): 

=  „  '  -  r  -».'(A)  d  A 

11«=1  -'0 

=  g(p)  ^  /  cosh(^A)p'(A)dA 

iU=l 

^  ^2c  'E'  I  sinh(^A)p(A)  d  A^ 

By  Theorem  2  of  [36],  we  have  p(A)  <  (FliLi  2ciHi)  So  continuing 

we  have: 

X  «  -  ('  +  r  sinhCA)*-^’/*^"-^"?'  d  a) 

We  need  an  upper  bound  for 

2(i  r^sinh(A)e-^'/PII“ll')dA 
Jo 


47 


which  comes  out  to  (by  standard  integral  tables) 


which  for  sufficiently  small  rj  >  0  (and  r)  =  0)  is 


<  +0(1)) 

combining  this  with  our  point-wise  bound  on  /  we  get 

g-or5/(mini(a;J‘-a;f)) 

^  1  +  \/l^l35  l|c||2  erf  e^^^^lldll/2 

2.9  Relation  to  Simulated  Annealing 

The  above  algorithm  has  some  similarities  with  simulated  annealing.  [45]  In  simu¬ 
lated  annealing  optimization  is  performed  by  moving  from  feasible  state  to  feasible 
state  by  choosing  moves  from  small  neighborhoods.  The  moves  are  selected  and 
rejected  in  a  manner  almost  identical  to  the  Metropolis  filter  as  applied  to  our 
function.  The  fact  that  our  algorithm  steps  through  infeasible  states  is  not  very  im¬ 
portant.  This  it  is  an  artifice  introduced  to  improve  the  mixing  rate  of  the  Markov 
chain  by  removing  small  sets  from  consideration.  The  constants  in  any  simulated 
annealing  problem  can  also  be  hidden  in  the  objective  function  (similar  to  La- 
grangian  relaxation).  The  second  point  of  similarity  is  that  where  our  algorithm 
moves  its  upper  bound  (by  finding  a  sufficiently  better  new  best  point)  or  lower 
bound  (by  building  a  probalistic  proof)  is  exactly  like  altering  the  “temperature” 
parameter  in  the  Boltzmann  equation  that  selects  moves  in  simulated  annealing. 
In  fact,  though  we  have  not  found  any  advantage  to  it,  we  could  redesign  our  algo¬ 
rithm  to  alter  (3  and  7  continuously  to  imitate  simulated  annealing.  Unfortunately 
this  would  mean  that  the  chain  is  no  longer  stationary  (as  the  chain’s  transition 


5-h^||c5||2 

y/l  ||c£l||2 
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behavior  would  change  over  time).  One  could  make  crude  arguments  that  the 
chains  behavior  over  any  time  period  would  be  no  worse  than  if  it  had  been  run 
with  the  parameters  at  the  end  of  the  time  period.  Then  one  could  appeal  to  a 
Markov  chain  central  limit  theorem  to  show  that  the  “tip”  would  still  be  visited 
with  high  frequency. 

It  is  a  very  interesting  open  problem  to  try  and  extend  the  theory  of  rapidly 
mixing  Markov  chains  to  the  stochastic  processes  like  simulated  annealing. 


2.10  Small  Sets 

The  treatment  of  rapidly  mixing  Markov  chains  given  here  completely  avoided 
the  “small  set”  issue  by  allowing  the  chain  to  move  to  infeasible  states  and  then 
applying  a  punishment  factor  to  return  the  chain  to  the  feasible  region.  This  had  the 
disadvantage  of  introducing  a  hard  to  characterize  gauge  style  function  depending 
on  a.  This  forced  us  to  make  S  very  small  so  we  could  apply  crude  point-wise 
bounds  on  this  gauge  function,  whereas  for  the  bias  function  we  were  able  to 
make  more  sophisticated  estimates  of  variation  over  regions  (via  the  Hoeffding 
inequality)  that  would  have  allowed  <5  to  be  a  factor  of  ^/n  larger  and  shaved  a 
factor  of  n  off  the  miming  time  of  the  Markov  chain. 

The  small  set  problem  is  that  if  we  have  a  Markov  chain  on  a  set  of  points 
chosen  from  a  convex  body  that  conductance  is  poor  near  the  boundary  of  the  body. 
In  fact  if  the  chain  uses  only  states  that  are  such  convex  bodies  it  may  not  even 
be  connected.  An  example  is  K  =  |(x,  y)  €  y  >  0,  y  <  |a;,  y  >  3a:  —  3 1. 

=  {(0,0),  (1,0),  (1, 1),  (2, 3)}  which  isn’t  connected  by  the  King’s  moves 
that  changes  a  single  coordinate  by  ±1.  We  are  certainly  not  going  to  be  able  to 
get  a  strong  conductance  result  for  this  chain  as  the  set  {(2,3)}  is  inescapable. 
There  are  a  number  of  ways  to  deal  with  the  above  problem.  The  one  we  used  is 
to  walk  on  a  large  region  of  Z’^  instead  of  Z”  fi  K  and  try  to  force  the  chain  into 
K  with  a  penalty  function.  This  has  the  advantage  of  simplifying  the  stmcture  of 
the  chain  and  the  disadvantage  of  introducing  a  hard  to  analyze  penalty  function. 

Lovasz  and  Simonovits  [40]  dealt  with  this  problem  by  introducing  a  notion 
called  s-conductance.  The  s-conductance  of  a  Markov  chain  is  not  the  worst  escape 
probability  of  any  set  of  states  in  the  chain  but  the  worst  escape  probability  of  any 
large  set  of  states  in  the  chain.  The  authors  then  perform  a  fairly  delicate  analysis 
to  show  that  analogues  of  the  convergence  theorems  known  for  conductance  are 
true  for  s-conductance.  Nothing  comes  for  free  so  some  care  must  be  made  that 
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the  chain  does  not  start  in  a  small  set.  In  addition  to  the  improved  mixing  rate 
of  the  Markov  chain  described  in  this  paper  this  paper  is  significant  because  it 
formalizes  many  of  the  geometric  methods  used  in  this  field  into  well  contained 
lemmas  (such  as  “localization”). 


Frieze,  Kannan  and  Poison  [28]  deal  with  the  problem  in  a  similar  manner. 
They  use  a  strong  minimax  characterization  of  the  eigenvalues  of  the  linear  op¬ 
erator  representing  the  Markov  chain.  In  this  characterization  the  convergence 
rate  of  the  Markov  chain  basically  the  gap  between  the  eigenvalue  1  and  the  next 
largest  eigenvalue  (the  separation  of  eigenvalues  from  -1  is  usually  easy  to  show). 
By  the  minimax  characterization  the  second  largest  eigenvalue  is  the  eigenvalue 
corresponding  the  vector  (f>  the  maximizes  the  Rayleigh  quotient  (j)^P4>/(j)^(f)  sub¬ 
ject  to  (jF-K  =  0  where  tt  is  the  stationary  distribution.  It  turns  out  that  similar 
convergence  theorems  can  be  proven  for  the  vector  that  maximizes  this  quotient 
subject  to  the  additional  restrictions  that  it  places  no  mass  in  any  of  the  states  near 
the  boimdary  of  K.  The  restrictions  are  easily  expressed  as  linear  relations  and 
lead  to  an  analysis  that  has  similar  consequences  to  the  s-conductance  arguments. 
Again  the  restricted  result  is  weaker  that  the  classical  result  and  care  must  be 
taken  that  one  does  not  start  near  any  of  the  bad  states.  The  direct  handling  of 
eigenvalues  and  eigenvectors,  without  explicit  mention  of  conductance,  are  likely 
to  yield  many  more  results  in  the  near  future. 


A  last  concept  that  is  in  development  by  Kannan,  Lovasz  emd  Simonovits 
is  average  local  conductance.  Local  conductance,  introduced  by  Lovasz  and 
Simonovits,  is  the  chance  that  Xt  4^  Af+i  or  that  the  chain  makes  a  non¬ 
trivial  transition.  Average  local  conductance  is  the  expected  value  of  P[Xt  4 
Xf+i  I  At  is  TT  distributed] .  Most  current  analysis  of  Markov  chains  state  how  many 
steps  the  chain  must  nm  for  some  property  to  hold.  Average  local  conductance  re¬ 
sults  are  theorems  of  the  form  “after  so  many  non-trivial  transitions  some  property 
holds.”  These  results  are  much  more  powerful  as  they  not  only  allow  one  much 
more  freedom  in  trying  to  estimate  how  bad  states  affect  a  Markov  chain  (one  can 
show  there  are  not  many  of  them,  or  one  is  unlikely  to  visit  them)  but  one  also  has 
the  option  of  proving  nothing  about  average  local  conductance  and  running  chains 
until  one  observers  empirically  that  the  requisite  number  of  non-trivial  transitions 
have  been  taken. 
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2.11  Stopping  time. 


Our  actual  implementation,  while  not  practical  for  general  use,  allowed  us  to 
empirically  observe  an  expected  behavior  of  the  Markov  chain.  Since  all  our 
proofs  of  stopping  time  must,  to  be  correct,  be  pessimistic.  The  chain  will  improve 
its  “best”  point  much  faster  that  the  analysis  indicates  and  it  is  worth  the  extra 
effort  to  continuously  monitor  the  chain  and  restart  it  at  a  new  best  point  when 
any  significant  fraction  of  improvement  is  noticed  in  objective  value.  With  this 
modification  the  chain  very  quickly  restricts  to  a  small  neighborhood  of  the  true 
optimal  point  and  then  only  has  to  run  to  the  theoretical  stopping  time  one  set  of 
times:  to  prove  the  lower  bound.  In  practice  almost  all  of  the  running  time  of  this 
algorithm  is  used  in  proving  the  final  lower  bound. 

2.12  King’s  move  data  structures. 

Using  the  King’s  move  has  a  number  of  implementation  advantages  over  using  a 
continuous  move  structure.  First  many  fewer  random  bits  are  needed  as  we  only 
need  log(n)  coin  flips  to  decide  which  direction  to  go  and  some  constant  number 
of  coin  flips  to  decide  whether  to  go  or  not  (as  we  have  restricted  F  to  vary  by  no 
more  than  a  constant).  We  can  also  update  our  state  in  constant  time  (change  one 
entry  in  a  vector).  We  have  also  found  that  membership  in  K  and  be  checked  faster 
than  is  the  case  for  more  general  moves.  If  K  is  given  by  m  linear  inequalities  then 
by  incrementally  updating  all  of  the  linear  relations  we  can  check  membership  in 
0{m)  time  instead  of  0{mn). 

We  have  tried  building  a  membership  oracle  for  the  component  commonality 
problem  by  simulating  the  numerical  integrations  required  to  test  membership 
in  K  by  using  sums  over  historic  data.  K  is  defined  as  the  set  of  a;  G  R™ 
such  that  F{y)  >  7  where  F  is  the  probability  density  function 

for  the  month’s  total  orders.  We  approximate  K  with  the  non-convex  body 
Ks  =  {a;  e  R™  7  E|=i,  Ayi<x^  ^  7  }  where  yi ,  •  •  • ,  are  points  drawn  from 
R”  according  to  tne  density  F.  We  are  using  the  standard  statistical  trick  of  using 
the  “empirical  distribution”  derived  from  previous  observations  to  simulate  the 
true  distribution.  We  know  by  central  limit  arguments  that  lim5_^co  Kg  =  K.  We, 
of  course,  have  the  problem  that  the  empirical  distribution  is  always  atomic  and 
Kg  is  almost  never  convex.  It  was  shown  by  Applegate  and  Kannan  [5]  that  small 
departures  from  true  convexity  do  not  overwhelm  the  Markov  chain  technique. 
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Notice  that  by  organizing  multiple  copies  of  j/i,  •  •  • ,  j/*  each  copy  sorted  by  a 
different  coordinate  we  can  compute  membership  in  Ks  in  time  0{s)  instead  of 
0{ms). 

These  savings  are  very  important  since  the  Markov  chain  we  use  the  mem¬ 
bership  oracle  every  step  and  the  membership  oracle  performs,  by  far,  the  most 
expensive  operation  each  step. 
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Chapter  3 

Contingency  Tables:  Exact  and 
Approximate  Counting 


3.1  Background. 

Consider  an  m  by  n  matrix  of  non-negative  integers.  This  matrix,  called  a  contin¬ 
gency  table  is  an  important  summary  tool  in  statistics  and  arises  in  the  following 
maimer.  Consider  a  finite  multi-set*  with  elements  {i,j)  {a  G  [1  •  •  •  m],  b  G 
[1  •  •  •  n]).  For  each  such  {i,j)  let  nij  denote  the  probability  that  an  element  drawn 
out  of  the  multi-set  with  uniform  probability  is  equal  to  (i,i).  If  such  a  multi-set 
represented  a  population  of  people  that  are  identified  only  by  hair  color  (Black, 
Brunette,  Red,  Blonde)  and  eye  color  (Brown,  Blue,  Hazel,  Green)  then  TiBiack, Green 
would  be  the  probability  that  a  person  picked  from  this  population  uniformly  at 
random  would  have  both  black  hair  and  green  eyes.  A  possible  experiment  would 
be  to  draw  a  multi-set  sample,  S,  of  T  elements  out  of  the  original  multi-set  (for 
simplicity  assume  we  draw  with  replacement).  We  could  then  summarize  this 
experiment  in  an  m  by  n  matrix  X  where 

Xij  the  number  of  occurrences  of  (i ,  j)  in  5.  (3.1) 

Define  tt^,*  =  7r*j  =  Y1T=\  consider  the  distribution  to 

be  independent  if  TTjj  =  Of  immediate  statistical  interest  is  testing 

if  the  original  population  is  independent,  given  an  observed  sample  X.  As  is  often 

'a  set  the  allows  repetitions. 
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the  case  in  statistics  we  can  not,  without  additional  information,  in  the  form  of 
priors,  determine  if  the  “Null  Hypothesis”  H\  :  tt^j  =  7ri,*7r*,j  Vi,  j  is  likely  to  be 
true  or  false.  Instead  we  calculate  a  significance  or  how  likely  X's  deviation  from 
independence  would  be  if  Hi  were  true.  If  X  is  too  far  from  independent  (Xj  j  too 
far  from  ^  / T)  then  we  may  suspect  that  the  original  population  was  not  Hi 

distributed. 


3.2  Statistical  hypothesis  testing. 

The  obvious  method  for  testing  if  X  is  typical,  assuming  Hi ,  would  be  to  use  the 
Xij/T  as  observed  estimates  for  the  iTij  and  then  to  check  if  the  independence 
relations  approximately  hold.  This  requires  estimating  (n  —  1 )  (m  —  1 )  parameters 
from  the  experiment.  A  slightly  more  subtle  approach  is  to  let  r*  =  X,,*  and 
Cj  =  X*  j  and  look  at  the,  unknown,  distribution  restricted  to  Xs  such  that  K',*  =  r, 
and  =  Cj.  The  density  function  for  any  population  obeying  Hi  restricted  to 
the  row  and  column  conditions  is  independent  of  all  of  the,  unknown,  irijs.  If  fact 
P(F|r,  c)  (the  probability  of  observing  the  table  non-negative  integral  table  Y  that 
has  row/column  totals  matching  r  and  c  is  the  so  called  Fisher/Yates  distribution: 


Thus  for  two  different  distributions  tt*  ,  tt^  on  the  original  population  that  obey  the 
independence  hypothesis  Hi  we  have  for  any  two  tables  X,  Y  such  that  X,-,*  = 
and  X* j  =  y* j  we  have  (X)/P,.  (Y)  =  P,2  (X)/P,2  (Y )  even  though  P,.  (X) 

may  be  very  different  from  P^2{X). 

Because  of  this  observation  it  is  considered  good  practice  (c.f.  [16])  to  perform 
the  desired  significance  tests  in  the  subset  of  contingency  tables  that  have  Y  ,*  =  r* 
andY.j  =  Cj.  Given  this  restriction  we  see  that  measuring  the  deviation  of  X  from 
independence  is  equivalent  to  measuring  the  deviation  of  X  from  the  intersection 
of  the  independence  surface  and  the  subspace  of  tables  that  obey  the  row  and 
column  restrictions.  The  surface  and  restriction  subspace  intersect  at  exactly  one 
point  y  such  that  Yj  =  Xj,*X*,j/r  so  we  have  again  reduced  the  hypothesis  Hi 
to  a  single  table  (instead  of  a  surface  of  tables)  but  this  time  without  estimating 
{n  —  l)(m  —  1)  parameters.  This  method  is  not  very  different  from  the  obvious 
one  given  above  but  does  permit  a  sharper  analysis  (and  tighter  bounds). 
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It  is  easy  to  show  the  above  density  function  is  log-concave  (replace  a;!  with 
r(a;  -f- 1 )  everywhere  and  notice  that  is  non-negative  for  x  > 

[1]  6.4.10).  Thus  we  know  that  if  the  density  will  fall  off  at  least  as  fast  as  some 
exponential  as  distance  from  the  mode  decreases.  It  can  be  shown  that  in  a  wide 
variety  of  circumstances  that  the  mode  is  very  near  the  independent  table^  so 
the  most  important  component  of  significance  is  going  to  be  distance  from  the 
independent  table. 

A  natural  candidate  for  this  metric  is  the  “chi-square”  distance  which  is  defined 


as: 


(y  •  - 

- T_J_ 


X:»X, 


hJ 


(3.3) 


The  of  a  table  Y  is  just  the  square  distance  of  Y  from  independent  where 
each  coordinate  is  rescaled  by  a  factor  of  /T.  A  natural  significance  test  for 

Y  is  then  to  measure  the  proportion  of  the  Fisher- Yates  distribution  corresponding 
to  tables  with  higher  than  Y.  If  very  few  tables  have  x^  higher  than  Y  we 
should  know  that  F  is  a  unlikely  table  under  the  independence  hypothesis  Hi . 


3.2.1  Fisher- Yates  test 

There  are  a  lot  of  asymptotic  methods  for  computing  the  significance  (with  respect 
to  the  Fisher- Yates  distribution)  of  a  contingency  table  [42,  30,  6,  7].  Most  of 
these  techniques  rely  on  the  central  limit  theorem  and  normal  approximations  to 
the  Fisher- Yates  distribution.  These  method  are  often  quite  far  off  for  moderate 
sized  tables. 

Fortunately  there  is  an  obvious  pseudo  polynomial  scheme  for  generating  such 
tables  from  the  Fisher- Yates  distribution.  So  randomized  approximation  schemes 
are  easy  to  implement.  Consider  the  complete  bipartite  graph  with  vertex  sets 
VuV2(\Vi\  =  \V2\^T). 

Vi  — ^  {{i,k)  \  i  =  I  •••  m,  k  =  I  -  Vi} 

V2  =  {(i,/)  I  j  =  l---n,  /  =  l-'-Cj} 

^The  mode  isn’t  always  at  the  independent  table.  Consider  a  tables  with  row  sums 
[1  1  1  1  1  1  10]  and  column  sums  [111111  10].  The  independent  table  has  density 
about  0.006  while  the  table  Y  with  Yij  —  0  for  <  7,  Yi^^  =  \  for  i  <  7,  Yjj  =  1  for  j  <  7  and 
Y;  7  “  4  has  density  about  .026.  These  two  tables  differ  by  more  than  2  in  the  bottom  right  entry. 


55 


Select  a  perfect  matching  M  uniformly  at  random.  Notice  that  there  are  T!  such 
perfect  matchings  (vertices  are  distinguishable,  edges  are  not).  Now  associate 
with  M  the  following  table: 

Xjj  =  ledges  of  the  form  {j,  /))  for  any  k,l\ 


Notice  X  obeys  the  row/column  sums  r,c  and  there  are  exactly 

different  perfect  matchings  that  map  to  this  X.  Thus  uniform 
M  leads  to  Fisher- Yates  distributed  X. 


3.2.2  Uniform  test 

A  major  problem  with  the  Fisher- Yates  distribution  is  that  it  is  very  tightly  concen¬ 
trated  about  its  mode.  This  causes  to  Fisher- Yates  test  to  reject  almost  all  tables 
when  T  is  large.  For  example  the  Fisher- Yates  test  will  reject  all  tables  where 
Xij  differs  from  the  independent  value  X^jXi^^jT  by  more  than  •Jx^jXi^^jT. 
This  guarantees  rejection  if  any  constant  percentage  of  the  classifications  are  in 
error,  no  matter  how  small,  is  present  in  the  data.  Diaconis  and  Efron  [16]  suggest 
a  number  of  techniques  to  deal  with  this  problem  including  a  significance  test 
based  on  the  uniform  distribution.  For  this  test  the  significance  is  the  ratio  of  the 
number  of  contingency  tables  with  greater  than  x^(F)  (and  identical  row  and 
column  sums)  to  the  total  number  of  tables  with  the  given  row  and  column  sums 
=  t|(r,  c).  This  test  has  the  nice  property  that  it  depends  on  the  structure  of  the 
table  Y  much  more  strongly  than  it  depends  on  T.  The  difficulty  is  that  computing 
the  numerator  of  this  ratio  has  been  shown  to  be  (IF  hard  in  general.[24]  Another 
quantity  of  interest  is  the  “likelihood  ratio  statistic”  which  is  the  ratio  of  how  likely 
a  table  is  under  the  Fisher- Yates  distribution  to  how  likely  the  table  is  under  the 
uniform  distribution  (=  P{Y\r,  c)/(l/ji(r,  c))).  This  quantity  is  useful  in  making 
qualitative  statements  about  a  table.  This  can  be  important  because  if  one  decides 
to  reject  the  independence  hypothesis  Hi  one  often  still  needs  to  characterize  the 
table  in  question. 

3.2.3  Some  counting  preliminaries 

We  now  investigate  ji(r,  c)  in  its  own  right.  Good  references  for  this  problem  are 
[18, 16,  20]. 
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— * 

When  we  have  m  =  n  and  r  =  c  =  fcl  the  problem  reduces  the  the  classic 
problem  of  counting  the  number  of  magic  squares.  The  counting  function  turns 
out  to  be  polynomial  of  degree  (m  —  l)(n  —  l)infc.  The  highest  degree  coefficient 
of  this  polynomial  is  the  volume  of  the  transportation  polytope,  a  quantity  of 
independent  interest.  Blakley[l  3]  showed  that  tt(r,  c)  is  a  piecewise  polynomial  (in 
r,  c)  of  degree  (m  —  l)(n  —  1)  and  [15]  reproves  this  result.  Sturmfels  strengthens 
these  results  [51,  52]  and  we  have  reworked  his  proofs  to  create  an  effective 
procedure  for  fixed  m,  n. 

From  [52]  a  generating  function  for  jj(r,  c)  is: 


m  n—l 


nn 


1 

1  -  XiPj 


n— »’m,cr"C„-i6^,Cn=^’^,  Cj-;  r,c>0 


(3.4) 


3.3  Barvinok’s  algorithm. 

Barvinok  [8]  recently  published  a  polynomial  time  algorithm  for  counting  the 
number  of  integer  points  in  any  integral  polytope  in  fixed  dimension.  This  result 
implies  that  counting  the  number  of  contingency  tables  for  fixed  m,  n  is  solvable 
in  polynomial  time.  Barvinok’s  result  reduces  the  counting  problem  to  solving  a 
family  of  sums  over  the  lattice  points  in  a  set  of  cones.  This  counting  scheme,  if 
applied  directly,  would  run  in  time  proportional  to  the  the  index,  with  respect  to 
the  standard  lattice,  of  the  integer  lattice  intersected  with  these  cones.  In  general 
these  indices  (also  called  the  cardinality  of  the  glue  group  in  Conway  and  Sloan) 
can  be  double  exponential  in  the  size  of  presentation. 

For  any  simple  rational  cone  K  in  let  xk  be  the  indicator  function  of  K. 
That  is  xk{x)  —  I  if  x  £  K  and  0  otherwise.  Since  K  is  a  simple  rational 
cone  there  is  a  linear  independent  set  of  integral  vectors  ui,  •  •  •  that  generate 
K.  Ind(/T)  is  then  the  size  of  the  quotient  group  (Z”  fl  Lin(ui,  •  •  •  ,ua;))/  < 
ui , •  •  • , u;:  >.  Barvinok  identifies  a  scheme  which  finds  cones  ^Ks  and 

integers  ei,---,e5  such  that  xa-  =  Ei=i  and  Ind(/fj)  <  Ind(/T)(^“*)/‘^ 
where  d  is  the,  fixed,  dimension  of  space.  Thus  in  only  a  log-log  number  of 
levels  of  recursion  the  problem  can  be  reduced  to  problems  involving  only  cones 
with  constant  index.  This  method  has  not  been  directly  applied  to  contingency 
tables,  though  the  unimodular  nature  of  the  contingency  table  relations  may  admit 
a  number  of  improvements  over  the  general  algorithm. 
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3.4  Piecewise  polynomialality  of  the  number  of  con¬ 
tingency  tables. 

Here  we  give  a  simple  proof  of  the  fact  that  the  number  of  contingency  table  is 
piecewise  polynomial  function  of  row  and  column  sums  where  the  number  of 
pieces  is  at  most  for  n  by  m  contingency  tables.  The  proof  is  not 

specialized  to  contingency  tables,  but  applies  to  any  counting  problem  involving 
totally  imimodular  constraints. 

Take  A,  a  totally  imimodular  dhyw  matrix  of  rank  d  such  that  the  only  solution 
to  Aa;  =  0,  a;  >  0  is  a;  =  0.  For  any  u  €  let  be  the  polytope  defined  by 

Pv  —  {x  €:  R“'  I  a;  >  0  ,  Ax  =  i;} .  (3-5) 

We  call 

(3.6) 

the  counting  function  of  A. 

Theorem  6  If  A  is  as  above  then  the  counting  function  of  A  is  a  piecewise 
polynomial  with  no  more  than  pieces  and  total  degree  equal  tow  —  d. 

This  is  not  a  new  result  (this  was  known  to  [1 5]  and  much  stronger  results  are 
known  by  Sturmfels  [51,  52,  11])  but  the  presentation  given  here  is,  hopefully, 
more  approachable  and  is  useful  in  describing  the  algorithms.  In  addition  to 
proving  the  counting  function  is  piecewise  polynomial  we  demonstrate  a  poly 
time  (for  fixed  dimension)  method  of  computing  the  decomposition  into  pieces 
and  techniques  for  inferring  the  polynomials.  These  methods  can  be  strengthened 
to  deal  with  the  non-unimodular  case  but  run  in  time  proportional  to  the  indices 
of  various  sub  lattices  in  the  standard  lattice.  Before  we  prove  the  theorem  we 
present  some  geometric  lemmas. 

We  wish  to  find  conditions  on  u,  v  so  that  the  operation  of  Minkowski  addition 
(point-wise  adding  two  sets)  and  adding  right  hand  sides  of  linear  relations  are 
identical,  or 

PuAPv^  Pu-vv  (3-7) 

Remark:  2  Clearly  P^pPv  Q  P^^^  but  this  containment  can  be  strict,  example: 


u 


=  {  1  99 

V  =  {99  1 

Pu^-Pv  ^  {(a:,  y)  \x>0,  y>0,  x  +  y  <  ,  x+  y  <  }  while  Pu+v  ~ 

{(®?  y)  \  X  >  0,  y  >  0,  X  +  2y  <  100 1. 

Let  €  R“  be  the  standard  basis  vector  such  that  =  Sij.  Let  x  be  an 
arbitrary  vertex  of  Pu-  At  least  w  —  d  of  the  entries  of  x  must  be  zero  and  there 
is  a  cr  C  { 1 , 2,  •  •  •  u;}  such  that  |(t1  =  d,  re,-  =  0  for  i  ^  cr,  and  if  A'  is  A  with  the 
w  —  d  rows  E\i  ^  a  added  on  then  A'  is  rank  w.  For  matrices  A  let  A„  denote  the 
d  by  d  matrix  formed  by  taking  (in  order)  the  columns  Ai  s.t.  i  €  a,  for  vectors 
X  let  X(r  denote  the  d-vector  corresponding  the  the  entries  whose  indices  are  in  cr. 
In  linear  programming  terms  A^  is  a  basis  and  a  is  the  indices  of  set  of  columns 
corresponding  in  this  basis.  We  see  that  if  Ax  =  u  then  we  must  have  A^Xt^  =  u 
and  I  det(A^)|  =  |  det(A')|.  So  A^  is  full  rank  and  we  see  that  x^  =  A~^u,  so  x  is 
completely  determined  by  a  (or,  again  in  linear  programming  terms,  the  naming 
of  the  basis  determines  a  basic  solution). 

Define  the  sets  of  indices  corresponding  to  basic  solutions: 

B  =  {<7  C  {1,2,  •  •  •  u;}  I  |cr|  =  d,  det(A<^)  ^  0} .  (3.8) 


And  further  define 


R  =  jx  G  R 


G  R, 


a;  is  a  row 


(3.9) 


as  the  “important”  relations  of  the  basic  solutions. 

For  cr  G  R  and  u  G  R'^  let  z/(c7,  u)  be  the  point  in  R’"  such  that  u(a,  u)i  =  0,i  ^ 
a  and  (z^(cr,  u))^  =  A~^u.  We  associate  with  each  u  a  function  Xu  '■  R“'  — ^  0, 1 
defined  such  that 


X«(^) 


1  r  •  u  >  0 
0  otherwise 


(3.10) 


It  should  be  clear  that 

vertices(P„)  =  |t'(cr,  u) 


(7  €  B,  Xu(r)  =  1  for  all  rows  r  of  ^  |. 


Lemma  6  If  Pu  and  Pu  are  pointed  and  bounded  and  Xu  =  Xv  then  P^  A  Pv  = 

P u+v- 
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Proof:  If  Pu  or  Py  is  empty  then  they  both  must  be  empty  and  the  theorem  is  true. 
Because  Pu  and  Py  are  both  pointed  and  bounded  we  know  that  Pu+v  is  pointed 
and  bounded.  Then,  since  we  only  need  to  show  C  it  is  sufficient 

to  show  if  2;  is  a  vertex  of  Pu+v  then  there  exists  x  ^  Pu  and  y  G  Py  such  that 
0  =  X  +  y.  It  is  easy  to  see  that  Xu+v  =  Xu{—  Xv)  because  r-{u  +  v)  =  r-u  +  r-v 
and  sign(r  •  u)  =  sign(r  •  u)  by  assumption.  So  we  find  a  a  ^  B  such  that 
u{a,  u  +  v)  =  z  and  we  know  that  x  =  v{a^  u)  G  P„  and  y  =  v)  G  P„.  □ 

Corollary  1  There  exists  a  set  of  full  dimensional  cones  -  •  •  Cr  inBf'  with 

the  positive  orthant  covered  by  Ci  such  that  ifu,v  G  Cj  then  PuP  Pv  =  Pu-vv 

and r  < 

Proof:  We  see  that  the  linear  relations  in  R  divide  the  space  of  right  hand  sides 
into  pieces  where  Minkowski  addition  of  polytopes  and  vector  addition  of  right 
hand  sides  operate  identically.  [P]  <  |P|  <  d|P|  <  It  is  a  well  known 

fact  that  k  hyperplanes  can  split  into  at  most 


regions.  We  have  k  < 


so  we  will  have  at  most 


d 


E 


full  dimensional  cones.  □ 

We  will  use  S  to  denote  the  set  of  full  dimensional  cones  (often  called  a  “fan” 
when  extended  to  include  all  lower  dimensional  cones).  For  m  by  n  contingency 
tables  we  have  d  =  m  +  n  -  \  (one  of  the  row  totals  is  dependent  on  the  other 
row  and  column  totals)  and  w  =  mn.  We  then  have  the  easy  upper  bound  of 
(mn)(”*+")^  cones.  We  have  completed  the  geometric  preliminaries  and,  after  a 
couple  lemmas  about  polynomial  arithmetic,  are  ready  to  prove  the  main  theorem. 
We  now  demonstrate  that  the  counting  function  is  a  low  degree  polynomial  when 
restricted  to  any  cone  C  G  Z. 

Lemma  7  Supposep:  -X  Q  is  a  rational  polynomial  of  degree  s  inxi,  ■  ■  ■  ,Xk 

M  is  a  d  X  k  rational  matrix  of  rank  d  and  Vx  >  0  integral,  Vy  integral  such 
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that  My  =  0  and  a;  +  y  >  0  we  have  p{x)  =  p{x  +  y)  then  there  is  a  unique 
rational  polynomial  p  of  degree  s  such  that  the  following  diagram  commutes: 


Proof:  First  we  prove  Va;,y  My  =  0  — >•  p{x)  =  p(x  +  y).  We  relax  the 
sign  conditions  on  x  and  x  +  y  and  the  integrality  of  x.  Suppose  y  G  is 
such  that  Va;€^*>0  A  a;  +  y>0  p{x)  =  p{x  +  y)-  Pick  b  such 
b  +  y  >  0  and  define  pi{x)  =  p{x  -f  b)  and  P2{x)  =  p{x  +  b  +  y).  We  see 
that  Va;  G  >  0  pi(a;)  =  piix)  which  is  enough  to  establish  that  they  are 
identical  polynomials.  Thus  we  have  if  y  is  integral  such  that  My  =  0  then 
Va;  p(x)  =  p{x  +  y).  Now  pick  any  y  £  such  that  My  =  0.  Define 
P2{j)  =  p{x  +  jy).  p3  is  a  degree  s  polynomial  inj  and  we  can  easily  find  s  +  1 
values  of  j,  ji,j2  •  "jd+i,  such  that  fy  is  integral.  We  have  p2{ji)  is  constant  at 
these  points  and  therefore  is  equal  to  the  constant  ^(a;)  everywhere.  In  particular 
p{x)  =  p3(0)  =  P3(l)  =  P(a;)- 

It  is  clear  that  there  is  a  unique  function  p ;  Q'^  Q  that  completes  the  diagram, 
so  it  remains  only  to  show  that  p  is  a  rational  polynomial  of  degree  no  more  than 
s.  Pick  lii,  -  , lia  linearly  independent  colunms  of  M  and  let  L  =  , •  •  • , kf) 

then  we  see  Vu  €  p{v)  =  p(L~^v).  □ 

We  now  state,  without  proof,  an  important  result  in  the  geometry  of  numbers 
(see  [34]  pp.  135-140). 

Theorem  7  (McMullen)  If  Pi,P2,  -  •  •  Pk  are  integral  polytopes  in  R®  then  for 
non-negative  A  €  2'*'  then  jj  \iP^  is  a  polynomial  in  Ai ,  A2,  •  •  •  A^  of  total 
degree  no  more  than  s. 

Theorem  ^  If  A  is  totally  unimodular  then  for  each  (7  G  £  there  exists  pc  a 
polynomial  of  degree  no  more  than  w  —  d  in  v\  ■  ■  ■  Vd  such  that  Vu  G  (7  fl  2“^  the 
number  of  integer  points  in  Py  is  equal  to  pc{v). 
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Proof:  (7  is  a  full  dimensional  rational  convex  polyhedral  cone,  so  by  Gordon’s 
Lemma  or  the  Hilbert  Basis  Theorem  [29]  (pp.  1 1  - 1 2)  we  know  that  the  semigroup 
of  integer  vectors  in  C  is  finitely  generated,  say  with  generators  c* ,  •  •  • ,  c*'  G  C 
Because  A  is  totally  unimodular  we  know  from  [34]  (pp.  135-140)  there  is  a 
polynomial  pc  of  degree  w  —  din  variables  xi  •  ■  ■  Xk  such  that  if  u  G  x  G 
and  V  =  (xi  >  0)  then  the  number  of  lattice  points  in  is  equal  to 

pcix).  If  ELi  Xid  =  ELi  2/iC*  then  we  must  have  pc{x)  =  pc{^  and  we  see 
that  we  meet  the  conditions  of  lemma  7.  with  the  polynomial  pc  and  matrix 
M  =  (c\  -  •  •  ,c^).  Thus  there  is  a  polynomial  pc  of  degree  no  more  than  w -din 
variables  v^vj-'-Vd  such  that  pclv)  —  '^{Pv)  for  all  u  G  (7.  □ 

For  contingency  tables  we  have  A  is  the  (m  -|-  n  —  1)  x  (mn)  matrix  where 

A  ■  =  l  ^ 

\  otherwise 

A  is  totally  unimodular  because  its  rows  are  split  into  two  classes  (i  <  m  and 
i  >  m)  such  that  each  column  has  exactly  one  entry  from  each  class  (page  276  of 
[46]  ).  The  above  theorems  imply  that  for  m  and  n  fixed  we  can  pre-compute  the 
complete  fan  E  and  for  each  cone  (7  G  E  pre-compute  the  enumeration  polynomial 
Pc-  This  means  that  for  fixed  m  and  n  that  the  counting  problem  for  contingency 
tables  can  be  solved  in  polynomial  time. 

Direct  formulae  for  2  x  n  and  3  x  n  contingency  tables  were  given  by  Brad 
Mann  [18].  While  these  formula  are  exponentially  large  in  n  it  has  been  noticed 
that  they  have  fewer  terms  than  the  a  dense  polynomial  for  a  cone  would  have 
(Mann’s  3  xn  formula  has  (2”)^  =  4"  terms  where  the  3  x  n  polynomial  could,  if 
dense,  have  (^^”2)  ^  0((6.75)”)  terms).  It  should  be  noticed  that  the  polynomial 
techniques  given  above  can,  without  modification,  handle  contingency  tables  with 
censored  entries  (just  leave  out  columns  of  A). 


3.5  Counting  tables  with  structural  constraints. 

A  table  with  a  structural  constraint  is  a  matrix  X  that  has,  in  addition  to  the 
constraints  mentioned  in  earlier  sections,  constraints  of  the  form  Xjj  =  v,Xij  <  v 
or  Xij  >  V  for  various  i,j.  It  is  easy  to  see  (by  modifying  r*-  and  Cj)  that  it  would 
be  sufficient  to  know  how  to  count  tables  with  constraints  of  the  form  Xij  =  0 
efficiently.  This  is  done  by  eliminating  these  constraints  one  by  one  using  the 
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relationship: 

l}(r;c  :  Xij  =  0,-  ••)  =  |l(r;c  :  ■  ■  -  •  •  .Vi-l,  -  •  ■  rmici,-  ■  •  :  •  •  •) 

(3.11) 

(when  ri,  Cj  >  0).  One  could  also  perform  directly  the  same  cone  decomposition 
that  works  for  tables  without  structural  constraints-  but  this  would  require  more 
storage. 

It  is  clear  that  the  number  m  x  n  table  with  arbitrary  structural  constraints  can 
be  counted  in  about  the  same  amount  of  time  asm  x  n  tables  (if  extra  oracles  are 
produced)  or  in  time  no  more  than  2™"  times  the  time  to  count  one  n  x  m  table 
using  just  one  counting  oracle. 


3.6  Assigning  a  manageable  order  to  tables. 


If  instead  of  counting  one  wishes  to  explicitly  enumerate  contingency  tables  it  is 
useful  to  define  a  total  order  on  tables.  This  order  is  also  very  useful  in  constructing 
a  contingency  table  sampling  scheme  from  a  counting  oracle. 

Let  r(ri  ,•••  r„;  Cl  ,•••  Cn)  =  r(r;  c)  be  the  set  of  m  by  n  non-negative  integer 
matrices  X  such  that  Vi  YTj=\  and  Vj  =  Cj.  We  assign  a 

linear  order  to  this  set.  The  order  we  have  been  working  with  is  the  “lex”  order 
defined  VX,y  €  T(r;c) 


(X  >  y) 


/  3a, 61<a<m,  l<6<n 


V 


A 

A 


Xa,b 

Vi  >  b  Xa,^ 
Vi  >  aVi  Xi^j 


>  K.6\ 

=  ) 

(3.12) 


With  some  care  we  can  step  through  all  the  tables  in  T{r;  c)  efficiently.  We 
appeal  to  a  routine  called  “backfill”.  When  X  is  a  m  by  n  matrix  and  a,  b  are 
integers  backf  ill(X,  a,  6,  r,  c)  returns  the  lex  least  non-negative  integral  table 
y  obeying  margins  r,  c  such  that  y-j  =  Xij  for  all  i  >  a  and  if  1  <  a  <  m  then 
Ya,j  =  Xa,j  for  all  j  >  b.  If  there  are  no  such  tables  backfill  returns  ±. 

backfill  works  the  following  simple  observation.  Let  T(r;c  :  Xm,jt  = 
Vi,“-Xm,jk  =  ^k)  be  the  set  of  tables  X  e  T{r;c)  such  that  Xm,n  = 
Ul,  •  •  •  XraJk  ~  '^k 


Lemma  8  T(r;c  :  =  Vfc)  =  0  iffT{r;c  :  Xmji  - 

ui,  •  •  •  X^j^  =  Vk,Xm,j^^,  =  Ufc+i)  =  0  where  all  the  ji  (i  =  1  •  •  •  A;  +  1) 
are  distinct  and  Vk+\  =  max(0,  min(  1  ’  j= 1  ) )  • 
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Proof:  Clearly  there  is  only  one  direction  to  prove  (as  adding  a  constraint  can  not 
make  an  infeasible  system  feasible).  So  assume  T(r;  c  :  Xmji  =  ui,  •  • '  Xm,jk  — 
Ufc)  ^  0  and  let  X  eT{r,  c;  j,  =  ,  •  •  •  =  Vk)  be  a  table  with  maximal 

It  is  clear  that  <  Vk+i.  IfX^j,^,  <  Vk+i  then  there  are  a,  6  such 

that  Xa,jk+i  >  0  and  Xm,b  >  0  and  b  distinct  from  all  j  and  we  could  build  a  new 
table  by  increasing  Xm,jk+\  ^  decreasing  Xa,^^^.,  and  Xm,b  hy  1. 

Thus,  since  Xm,jk+\  was  maximal,  have  a  feasible  table  with  Xmjk+i  ~  ‘^k+i- 

So  backfill  can  operate  by  filling  in  the  highest  index  row  first  (the  most 
significant  entries  fall  in  this  row)  and  trying  to  put  as  much  mass  as  possible  in 
the  least  significant  (left  most)  positions  in  the  row  (using  lemma  8).  If  no  sums 
are  violated  then  the  table  created  is  the  lex  least  matching  the  specified  partial 
table,  otherwise  there  are  no  such  tables  (and  backf  ill  should  return  -L). 

One  can  efficiently  step  through  all  the  tables  in  T(r;c)  by  calling 
backfill(0,m  +  1,0, r,c)  to  get  the  lex-first  table  Yi.  To  proceed  to  the 
next  table  one  sets  (a,  b)  to  the  least  significant  entry  (1,1),  sets  Y  to  be  the  current 
table  with  the  (a,  6)th  entry  incremented  by  1  and  calls  backf  ill(y,  a,  b,  r,  c) 
and  determines  if  the  returned  table  is  feasible.  If  the  table  is  infeasible  (a,  b) 
is  advances  to  the  next  significant  entry,  which  is  increased  by  1,  and  the  call  is 
repeated.  The  first  legal  table  returned  is  the  next  table  in  our  lexicographic  order. 
This  method  imitates  the  propagation  of  carries  found  in  counting,  so  even  though 
a  simple  analysis  indicates  that  it  could  take  nm‘^{T{r;  c))  fill  m  attempts  to  run 
through  all  KT(r;c))  tables  it  is  easy  to  see  (for  tables  where  r  and  c  have  all 
largish  entries)  this  method  actually  lists  all  the  tables  using  only  0(S(r(r;  c))) 
calls  to  backfill. 


3.7  Divide  and  conquer. 

Let  Xi  denote  the  ith  row  of  a  matrix  X.  The  basic  method  we  are^using  to 
count  m  by  n  tables  (m  >  n)  is  as  follows:  let  k  -  [m/2j  and  qi  =  E;=i  and 
qi  —  E£fc+i  ’"i  we  then  notice  that 

ti(T(ri,--T,n;ci,---c„))  =  Y.  lt(T(ri,---rfc;Xi))x^(r(rfe+i,---r,n  X  )). 

X£T(q\^q2\c\^'"Cn) 

The  idea  is  simple:  we  can  split  each  table  into  two  independent  set  of  rows 
if  we  know  how  much  each  column  sums  to  in  each  set  of  rows.  Thus  if  we  s\im 
over  all  ways  the  columns  can  split  their  totals  between  these  sets  (the  set  of  such 


64 


splits  is  equivalent  to  a  set  of  2  x  n  tables)  then  we  can  multiply  the  number  of 
ways  to  fill  in  the  top  and  bottom  portions-  allowing  us  to  count  much  faster  than 
explicit  enumeration. 

This  formula  can  be  applied  recursively  and  we  can  stop  the  recursion  at  the 
base  cases  where  m  or  n  is  equal  to  one  and  the  answer  is  one.  This  method  is 
much  faster  than  explicitly  enumerating  all  of  the  tables  (in  fact  it  was  able  to 
calculate  the  number  of  tables  consistent  with  Snee’s  hair  and  eye  color  margins  in 
under  20  minutes  on  an  HP720).  This  method  is  not  polynomial  even  for  fixed  m 
and  n.  However  it  is  sufficient  to  solve  counting  problems  to  infer  the  piecewise 
polynomial  that  is  the  true  counting  function. 

Remark:  3  Actually  exhaustive  enumeration  would  be  sufficient  to  imply  a  poly¬ 
nomial  time  counting  method for  fixed  m,  n  as  the  tables  that  need  to  be  solved  to 
infer  the  piecewise  polynomial  depend  only  on  m  and  n  and  not  on  the  margins 
of  any  particular  table  one  wishes  to  know  the  count  of  This  would,  of  course, 
be  prohibitive  in  practice  so  it  is  desirable  to  develop  the  divide  and  conquer 
technique. 


3.8  Inferring  the  piecewise  polynomial. 

At  this  point  we  have  an  effective  scheme  for  identifying  the  regions  that  the 
counting  function  is  a  well  behaved  on.  All  the  remains  is  to  infer  the  polynomial 
for  each  piece.  There  are  some  results  in  commutative  algebra  that  relate  the 
polynomials  to  “Hilbert  Series”  and  “Todd  Classes”,  but  these  structures  encode  a 
lot  of  information  and  are  in  themselves  often  hard  to  compute.  The  strategy  taken 
here  is  to  assume  access  to  a  counting  oracle  (in  this  case  simulated  by  the  divide 
and  conquer  counter)  and  then  recover  the  desired  polynomial  by  interpolating  the 
values  known  in  a  region. 


3.8.1  Lagrange  interpolation 

The  first  interpolation  method  is  not  too  sophisticated.  We  find  a  point  b  in  the 
cone  we  are  working  with  such  that  b  +  sE^  (s  is  the  degree  of  the  polynomial) 
is  in  the  cone  for  all  i.  We  know  such  a  point  must  exist  because  the  cone  we  are 
using  is  pointed  and  full  dimensional  (and  therefore  contains  arbitrarily  big  balls 
as  we  move  away  from  the  origin).  We  then  count  (by  divide  and  conquer)  all 
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the  tables  corresponding  to  any  set  of  margins  found  in  the  set  of  points  x  € 

{d  =  r  +  c— l,w  =  rc,s  =  (r  —  l)(c— 1)) suchthata;  >  6andl-(a;  — 6)  <  s.  The 
counts  at  these  points  are  used  to  determine  the  polynomial  by  using  “Lagrange 
interpolation.”  For  6,  a;  G  Z'^  and  s  ^  Z  such  that  x  >  b  and  1  •  (a;  —  6)  <  s 
the  Lagrange  polynomial  lb,x,s  is  the  unique  degree  s  polynomial  in  variables 
vi---vd  such  that  lb,x,s{^)  =  1  h,x,s{y)  =  ^  for  any  integral  y  ^  x  such  that 
y  ^  Z"^,  y  >b,  I  ■  {y  —  b)  <  s.  lb,x,s  has  a  very  simple  definition 


n;=,  (i  +  l't-f-f) 

(vi-hi)  1 

{xi-bi)''b+E\x,s-\ 


X  =  b 

Xi  >  bi^  Xj  =  bj  Vj  <  i 


(3.13) 


It  is  then  easy  to  see  that 


Pc  =  XI  F{x)lb,x,s 


(3.14) 


where  F( a;)  is  the  number  of  integer  points  in  P^. 

This  technique  is  important  because  we  do  not  have  to  explicitly  store  a  linear 
transformation  to  fit  the  polynomial  coefficients  to  the  known  evaluations  (in  the 
4x4  case  there  are  =  11440  coefficients  to  fit  which  would  require  a 
1 1440  X  1 1440  matrix  if  the  linear  transformation  were  stored  explicitly.  If  each 
entry  took  just  16  bytes  to  store  this  would  still  represent  almost  2  gigabytes  of 
storage).  Also  if  extra  storage  is  available  this  method  can  change  k  polynomial 
values  into  k  coefficients  in  time  0(^k^).  This  can  be  accomplished  by  building 
a  table  of  all  k  Lagrange  polynomials  in  a  sort  of  Gray-code  order  (where  most 
polynomials  in  the  table  can  determined  by  dividing  a  known  one  by  a  binomial 
and  multiplying  by  another  binomial  in  0{k)  time).  Even  though  this  requires  as 
much  storage  as  the  direct  linear  algebra  approach  it  is  much  faster.  The  primary 
weakness  of  this  method  is  that  b  may  be  very  large  and  require  the  solution  of 
very  difficult  counting  problems  (though  the  set  of  problems  requiring  solution  is 
very  structured,  allowing  some  savings). 

A  nice  generalization,  which  would  help  with  “large  6”  problem  mentioned 
above,  would  be  to  find  how  to  easily  compute  Lagrange  polynomials  for  an 
arbitrary  set  S  such  that  S  determines  pc  and  S  has  minimal  cardinality.  The 
polynomials  would  then  be  indexed  by  x,S  and  d  where  lx,s,s  would  be  the  unique 
degree  s  polynomial  such  that  lx,s,s{^)  —  1  and  lx,s,s  is  zero  on  S  \  x.  By  linear 
algebra  we  know  that  such  a  basis  for  the  space  of  all  degree  s  (or  less)  polynomials 


66 


exists-  but  it  is  not  clear  that  it  can  be  competed  with  limited  space  and  time  (i.e. 
with  less  space  than  the  number  of  terms  squared). 

One  conjectured  method  to  efficiently  perform  this  calculation  would  be  to 
hope  that  all  minimal  S  have  structure  similar  to  the  simplex  we  used  in  the  last 
section.  The  recursive  formula  given  for  the  Lagrange  polynomials  depended  on 
finding  a  set  of  degree  1  polynomials  ((u;  -h),  i=  ••■d  and  {I -  v-s-l-b)) 
that  each  zeroed  out  at  least  points  in  S  but  have  no  common  zero  in  S. 

Even  for  the  simple  case  of  d  =  2,  s  =  2  this  is  not  always  possible.  Take 


5  =  {(0,0),  (1,0), (0,1), (1,1),  (3, 2),  (2, 3)}.  (3.15) 


We  expand  elements  of  S  to  vectors  in  R,^  by  taking  [x,  y)  (1 ,  a;,  y,  xy,  re  ,  y  ) 

and  write  down  the  6  x  6  matrix  gotten  by  expanding  all  of  S: 


M  = 


'  1 

1 

1 

1 

1 

1 


0  0  0  0  0' 

10  0  10 
0  10  0  1 

11111 
3  2  6  9  4 

2  3  6  4  9, 


(3.16) 


We  see  that  det(M)  =  327^0so5'isa  minimal  basis  of  the  kind  we  wanted.  But 
_  3  points  in  S  are  collinear-  so  there  is  no  way  to  build  the 


no  "  j  =  ~  ^  pomis  m  o  are  eumnecu-  su  uicic  la  nu 

Lagrange  polynomials  up  from  degree  1  polynomials  as  before. 


3.8.2  An  improvement 

A  simple  improvement  is  to  notice  that  if  M  :  ^  R”^  is  a  full  rank  linear 

transformation  and  p  :  R*^  — >•  R  is  a  rational  polynomial  of  degree  no  more  than  s 
then  ^defined  such  that  p{x)  =  p{M~^x)  is  also  a  rational  polynomial  of  degree 
no  more  than  s.  For  each  cone  (7  G  S  we  let  Me  be  the  d  x  d  matrix  whose 
columns  are  the  first  d  integral  vectors  in  C  such  that  M  is  full  rank  (vectors 
ordered  in  graded  lex  order).  Then  we  see  if  S  =  ^x  ^  Z'^  |x>0,l'a:<s| 
then  MciS)  is  a  set  of  points  that  determine  the  interpolation  polynomial  for  the 
cone  C  and  p  can  be  inferred  by  the  above  naive  method.  The  advantage  is  that 
this  set  of  problems  may  be  much  smaller  (and  therefore  easier)  that  the  set  of 
problems  that  the  original  interpolation  method  would  require.  Finding  a  basis  for 
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7i^  in  the  given  cone  is  of  course  a  hard  problem,  but  it  is  a  problem  in  7/^  not 
in  the  much  larger  space  of  coefficients  and  evaluations.  In  practice  d  is  so  much 
smaller  than  the  number  of  coefficients  needed  that  the  time  needed  to  search  for 
the  small  basis  by  brute  force  has  been  negligible  compared  to  any  of  the  other 
steps  in  the  algorithm.  It  would  be  interesting  to  use  more  subtle  techniques  such 
as  computing  a  Hilbert  basis  for  the  cone  and  examining  it  for  an  acceptable  short 
basis. 


3.9  Sampling  from  counting. 

For  many  statistical  tests  it  is  important  to  be  able  to  generate  a  table  uniformly 
at  random  that  obeys  given  row  and  column  sums.  It  is  well  know  [38]  that 
sampling  and  counting  are  intimately  related.  For  this  problem  the  relationship 
can  be  exposed  by  combining  the  fixed  dimensional  counter  and  the  lex  ordering 
introduced  here  for  brute  force  enumeration.  The  method  is  as  follows. 

Suppose  we  want  to  generate  a  table,  X,  uniformly  at  random  obeying  rows 
sums  Tj ,  r2,  •  *  •  f'm  column  sums  ci ,  C2,  •  •  ■  where  m,  n  are  fixed.  We  will 
assume  that  we  have  pre-computed  a  representation  of  the  piecewise  polynomial 
counting  function,  so  we  can  count  tables  at  will.  First  we  compute  N,  the  total 
number  of  tables  obeying  r,  c.  We  generate  a  integer  fc  in  the  set  {1 , 2,  •  •  •  iV} 
uniformly  at  random.  We  are  going  to  return  the  lex  kih  table  as  our  uniform 
random  sample.  First  we  look  at  the  most  significant  entry  of  the  tablei  Xjn,n- 
If  for  an  integer  b  such  that  more  than  k  tables  obeyed  r,  c  and  the  structural 
constraint  Xm,n  <  b  then  we  would  know  that  more  than  the  lex  kih  table  must 
have  Xm,n  <  b.  Similarly  we  know  that  if  fewer  that  k  tables  obeyed  r,  c  and 
Xm,n  <  b  then  the  lex  kih  table  must  have  X^.n  >  b.  Thus  we  can,  using  binary 
search  on  b,  find  in  time  log2(min(r,„,  c„))  find  the  true  value  of  Xm.,n  for  the  lex 
k'ih  table.  We  then  add  Xm,n  =  6  as  a  structural  constraint  and  use  the  same 
technique  to  find  the  correct  value  for  the  next  most  significant  entry.  This  process 
is  repeated  until  the  entire  table  is  constrained  and  then  X  is  the  lex  A:th  table  as 
desired.  This  method  can  generate  a  true  uniform  sample  by  solving  no  more  than 
(m  -  l)(n  -  1)  log2(min(rm,c„))  structural  counting  problems.  We  can  either 
assume  that  we  had  the  (m  - 1 )  (n  - 1 )  different  types  of  counting  polynomials  pre¬ 
computed,  or  by  using  the  formula  3.1 1  judiciously  and  converting  each  structural 
problem  into  no  more  that  counting  problems  (notice  that  we  apply  structural 
constraints  to  only  one  row  at  a  time  until  the  row  is  completely  constrained). 
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3.10  4x4  results. 


We  have  solved  the  m  =  n  =  4  case.  This  case  splits  into  (after  some  symmetries 
are  removed)  3694  cones  such  that  U,4  restricts  to  a  degree  9  polynomial  in  7 
variables  in  each  cone.  This  means  each  polynomial  is  determined  by  11440 
coefficients  so  can  be  interpolated  from  11440  sufficiently  general  evaluations. 
Each  of  the  3694  tasks  seems  to  represent  about  3  hours  of  pmax  cpu  time,  so  the 
total  job  represents  about  1.5  pmax  cpu  years.  This  task  was  completed  in  just 
over  6  weeks  using  Peter  Stout’s  WAX  system  [50]  which  effectively  simulates  a 
coarse-grain  parallel  supercomputer  (employing  the  numerous  idle  workstations 

at  CMU). 

This  decomposition  is  able  to  count  the  number  of  tables  compatible  with  the 
hair/eye  color  table  1.1  in  30  seconds.^ 

3.10.1  Snee’s  table 

This  table  1.1  has  been  our  main  test  case.  It  can  be  counted  in  under  20  minutes 
using  the  divide  and  conquer  technique.  The  divide  and  conquer  counter  has  been 
augmented  to  generate  a  batch  of  uniform  random  samples  while  counting.  This  is 
accomplished  by  generating  a  large  set  uniform  random  integers  from  the  range  1 
to  the  total  number  of  tables  compatible  with  the  margins.  The  divide  and  conquer 
procedure  is  then  run  with  the  batch  being  divided  into  the  appropriate  divisions 
until  the  tables  are  completely  filled  out.  The  indices  used  here  are  not  in  the  lex 
order  used  in  other  sections  but  in  an  arbitrary,  but  constant,  order  determined  by 
the  coding  of  the  divide  and  conquer  algorithm.  This  method  when  dealing  with 
batches  of  10000  tables  can  generate  about  20  tables  a  second.  The  pre-computed 
4x4  solution  can  count  Snee’s  tables  in  about  30  seconds.  And  this  method  can 
generate  a  uniform  random  sample  in  a  couple  of  minutes.  The  generator  can  be 
converted  into  a  batch  process  (so  many  tables  share  the  first  few  hard  counting 
problems)  to  generate  uniform  random  tables  in  a  batch.  The  savings  would  come 
from  many  tables  sharing  the  harder  early  counting  problems. 

The  results  to  date  on  this  table,  which  clearly  has  been  over  studied,  are  that 
best  estimate  of  the  chi-square  significance  of  the  table  is  15.5%  with  a  3  standard 
deviation  confidence  interval  of  ±.15. 

This  agrees  well  with  results  from  two  Markov  chains  (described  later)  one 

^  All  CPU  times  given  in  this  section  are  HP720  CPU  seconds  unless  otherwise  noted. 
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based  on  a  natural  “King’s  move”  walk  and  one  based  on  the  newer  “ball  move” 
walk.  These  chains  returned  estimates  of  15.6  (±3  std  dev  interval  15.2  •  •  •  15.9) 
and  15.6  (±3  std  dev  interval  14.1---17.1)  respectively.  The  second  confidence 
interval  is  much  wider  as  the  chain  ran  for  a  much  shorter  time. 

The  confidence  intervals  would  not  have  been  possible  without  the  proof  of 
convergence.  One  could  design  and  run  one  of  these  chains  Avithout  the  conver¬ 
gence  proof.  However  for  a  given  length  run  one  would  not  be  able  to  estimate  how 
many  truly  random  samples  the  set  of  samples  drawn  for  the  chain  was  equivalent 
to.  So  there  would  be  no  way  to  draw  confidence  intervals  around  the  estimate. 
Thus  one  could  be  in  the  unhappy  circumstance  of  having  two  chains  in  wild 
disagreement  and  be  able  to  draw  any  conclusions. 


3.11  5x4. 

The  5x4  problem  is  large  enough  to  be  considered  impractical.  However  for  any 
one  table  one  can  identify  a  cone  its  right  hand  side  lies  in  and  infer  the  polynomial 
for  this  cone  alone.  In  fact  we  have  found  significant  savings  in  not  inferring  the 
true  polynomial  by  computing  /(^)4  (where  f(x)  is  the  true  value  and  is  the 
Lagrange  polynomial  such  that  lx{x)  =  1  and  lx{y)  =  0  for  all  other  points  in  the 
interpolations  set)  but  computing  the  count  for  the  desired  table,  2;,  by  computing 
/(^)  =  f{x)lx{z).  lx{z)  is  an  integer  (not  a  polynomial)  and  can  be  computed 
much  quicker  than  can  and  the  process  requires  almost  no  storage. 

This  method  was  used  to  determine  the  number  of  tables  compatible  with  the 
margins  [  182,  778,  3635,  9558,  11110  ]  and  [  3046,  5173,  6116,  10928  ]  is 
23196436596128897574829611531938753  in  8  days  on  an  HP720  workstation. 
While  this  may  not  seem  quick  it  should  be  pointed  out  that  computing  the  sum 
mentioned  in  the  previously  paragraph  is  “embarrassingly  parallel”  (most  of  the 
time  would  be  spent  in  computing  the  terms  of  the  sum  without  any  need  for 
communication)  so  this  calculation  could  he  done  quickly  on  a  parallel  computer. 


3.12  m  X  2  and  m  x  3. 

Brad  Mann  has  developed  effective  formulas  for  both  the  m  x  2  and  m  x  3  case. 
Mann’s  results  are  inclusion/exclusion  formulas  over  all  partitions  of  m.  These 
formulas  have  a  number  of  terms  comparable  to  the  polynomials  developed  here  for 
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exact  computation  but  have  the  distinct  advantage  of  involving  no  pre-computation 
and  use  very  little  space. 


3.13  Asymptotic  performance  of  estimates. 


Asymptotic  behavior  of  the  contingency  table  counting  function  can  be  easily  read 
off  our  chamber  decomposition.  This  has  made  possible  an  interesting  comparison 
of  several  estimation  schemes  in  the  literature. 

The  first  estimate  is  from  [44]  and  is  intended  for  sparse  tables.  Let  T  = 
ri  =  TTj=\  Cj  and  assume  the  row  sums  are  all  less  than  log(min(m,  n 

then 

«(r;c)  -  E\{r,c)  =  (3.17) 

as  m,n  ^  oo.  When  the  sparsity  condition  is  not  met  this  estimate  tends  to  be  off 
by  an  exponential  factor  (and  is  not  used  for  the  type  of  contingency  tables  arising 
in  statistics). 

The  next  estimate  is  from  [30]  and  is  intended  for  non-sparse  tables  (with  a 
large  number  of  large  rows).  The  idea  is  to  take  the  number  of  tables  obeying 
the  row-  constraints  and  to  multiply  it  by  an  estimate  of  the  odds  of  satisfying  the 
column  constraints. 


a 


2 


Q  = 

tt(r;c) 


{n  -  \)T,ri{ri  +  n) 
(n  -|-  l)n2 


n  —  I 


0  /  J 

(T^n 


Y.c]-T^ln 


E2(r;  c)  = 


/^-lN(n-l)/2  m  /  .-^^_1\ 

Xlircr^n)  M  V  n  —  \  J 


(3.18) 


And  the  final  estimate  is  from  Diaconis  and  Efron  [16].  This  estimate  is  based 
on  a  volume  times  density  of  lattice  argument-  with  the  novel  innovation  that 
the  volume  is  purposefully  overestimated  to  compensate  for  some  comer  effects 
missed  in  this  type  of  analysis.  Because  of  this  the  Diaconis/Efron  technique 
estimate  is  much  harder  to  analyze  and  not  much  is  know  about  it. 


1 

1  -|-  mn/2r 
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Vi  = 


Ci  = 


k  = 


I  —  W  WVi 
m 

1—2  wci 


n  T 
n  +  1  1 


nEr=in"  « 


tt(r;c)  ~  E3{r\c)  = 

(m  \  ”-i 

S") 


+ 

-  X 

2  ; 


fc-i 


u=l 


r(nA:) 


r(n)”^r(fc)^ 


(3.19) 


The  first  asymptotic  trajectory  we  examine  is  coxmting  the  number  of  tables 
with  row  sums  [t,kt,  kt]  and  column  sums  [t,kt,  kt]  where  fc  is  a  positive  integer. 
By  examining  the  complete  3  by  3  solution  it  is  clear  that  the  number  of  tables 
obeying  these  margins  is 


k 

3 


which  is  asymptoticly  as  ^  oo  and  k  is  held  constant.  A  little  work 

shows  the  asymptotic  behavior  of  the  estimates  on  this  trajectory  is  as  follows. 


iog(£;i) 

E2 

E3 


(1  +  Ik'^Y  2 
2{\  +  2ky 

3V3fc2 


47r(2F+l)e4(fc-i)V(2^Hi) 
2  ^ 


kH^ 


(3.20) 

(3.21) 


(2+4  fc)^ 


8(i+2fc)2r(5+^g^)' 


Ki .2 fer 

(k 2 •p/3+16fe+14fc^  x 

l+2fc^  ^fc^/3^4  p_22) 


Estimate  £^1  is  not  really  worth  discussing  (it  was  never  intended  for  this  case). 
Estimate  E2  tends  to  overestimate  the  count  by  a  factor  of  about  k  and  E3  tends  to 
underestimate  by  a  factor  of  .  Figure  3 . 1  plots  the  behavior  of  these  asymptotic 
expressions  as  a  function  of  k.  The  graph  is  the  value  of  equations  3.21  and  3.22 
and  the  true  asymptotic  behavior  (ast  ^  oo),  all  divided  by  as  functions  of  k. 
Figure  3.2  shows  the  same  trend  with  the  true  count  rescaled  to  be  1. 
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solid=true,  dashed=E2,  dotted=E3 

Figure  3.2:  Rescaled  asymptotic  behavior  of  counting  estimates. 


Another  interesting  trajectory  is  Along  this  trajectory 

the  number  of  contingency  tables  is  asymptoticly  t^/3,  E2  is  asymptoticly 
3\/3/(47re'’)t^  and  E3  is  asymptoticly  t^/4. 


3.14  Large  tables  with  small  margins. 


The  divide  and  conquer  counter  (when  combined  with  Mann’s  formulas  as  base 
cases)  is  able  to  exactly  tables  from  the  literature. 

For  example,  Mehta  and  Patel  [42],  in  the  course  of  performing  significance 
tests,  estimate  counts  for  the  following  tables  (using  Gail/Mantel’s  method  [30]). 


Problem  1 : 


Problem  2: 


Problem  3 : 


Problem  4: 


Total 


Total 


Total 


Total 


Total 


1 

1 

1 

0 

0 

0 

1 

3 

3 

10 

4 

4 

4 

4 

4 

4 

4 

1 

1 

30 

5 

5 

5 

4 

4 

4 

5 

4 

4 

40 

Total 


2 

0 

1 

2 

6 

11 

1 

3 

1 

1 

1 

7 

1 

0 

3 

1 

0 

5 

1 

2 

1 

2 

0 

6 

5 

5 

6 

6 

7 

29 

Total 

2 

0 

1 

2 

6 

5 

16 

1 

3 

1 

1 

1 

2 

9 

1 

0 

3 

1 

0 

0 

5 

1 

2 

1 

2 

0 

0 

6 

5 

5 

6 

6 

7 

7 

36 

Total 

1 

1 

1 

0 

0 

0 

1 

2 

4 

10 

4 

4 

4 

5 

5 

5 

6 

5 

0 

38 

1 

1 

1 

0 

0 

0 

1 

2 

4 

10 

6 

6 

6 

5 

5 

5 

8 

9 

8 

58 
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Problem  5: 


Total 


Total 

Total 

Problem  6: 

Total 


Our  results  working  with  these  tables  are  as  follows: 


Problem 

True  count 

reported  est 

Gail/Mantel  ests 

Diaconis/Efron  ests 

1 

35353 

40500 

34534,73397 

37992,40169 

2 

3187528 

1.1  *  10^ 

3.01  *10S  3.84  *10® 

3.30*  10®,3.32*  10® 

3 

97080796 

68  *  10^ 

125  *  10®,68  *  10® 

110*  10®,  112*  10® 

4 

1326849651 

624*  10^ 

1963  *  10®,  519*  10® 

1615*  10® ,  1757*  10® 

5 

2159651513 

1.6*  10^ 

2.5*  IOM.7*  10^ 

2.6*  10^2.6*  10® 

6 

108712356901 

64  *10^ 

132  *  10^  64  *  10^ 

144*109, 149,^109 

Table  3.1 :  Estimates  from  the  literature. 


Both  the  Gail/Mantel  and  Diaconis/Efron  estimate  are  asymmetric  (can  give 
different  estimated  counts  for  a  table  and  its  transpose),  so  we  have  reported  both 
estimates  (with  the  better  one  first).  Somebody  actually  using  these  estimates 
would  not  be  able  to,  in  all  cases,  identify  the  best  of  the  two  estimates.  The 
differences  between  Mehta/Patel’s  reported  values  of  the  Gail/Mantel  estimate 
and  values  given  in  lines  1,2  and  4  of  this  table  are  disturbing  (the  others  are 
insignificant). 
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3.15  Magic  squares 


As  an  exercise  we  were  able  to  rederive  all  the  known  magic  square  Ehrhart 
polynomials.  Pn{x)  is  defined  as  the  number  of  n  x  n  non-negative  integer 
matrices  such  that  every  row  and  every  column  adds  up  to  a;.  p\  -  --pe  are: 


Pi  (a;) 
P2{x) 

P3{x) 

p^{x) 


P5{x) 


P6{x) 


1 

I  +  a; 

,  9  X  \Sx^  3  x^  x'^ 

,  65  a:  379  a;^  351 17  a:^  43  a:^  1109a;5  2x^  19  a:’ 

18  63  5670  10  540  3  135 

II  a:*  Ux^ 

630  '*'11340 

,  ,  725  a;  ,  6229735  a;^  ^  3028287247  a;^  438177965089  a;^  66411 8435  a:^ 

144  +  494208  ^  145297152  17435658240  ^  28740096  ^ 

3812839477  a;^  196563587  a;’  3541860299  a;*  55426325  a;^  125188639 

229920768  ’*'  20901888  836075520  ^  36578304  292626432 

984101a:"  72750523  a:"  112655  a:"  1008757  a:"  188723  a:" 

10450944  '*’  4598415360  ^  57480192  5977939968  20922789888 

188723  a:" 

836911595520 

3899  a:  46584105377  a:’  12246206617138789  a:’  382955230861099213 

600  2141691552  ^  247365374256000  45171 06834240000 

155498465793777230567  a:’  14226886368398551  a:^  243245 11 16263 17349  x 

1355132050272000000  112634352230400  2111894104320000 

232132948167689  X*  ,  25357819401 1961479  x^  736591080322991  x" 
2634721689600  4446092851200000  ^  23433524674560  ^ 

16265048187290869  x"  2000221303490489  x"  570713692223620411  x" 
1098446469120000  334764638208000  276180826521600000 

8346012436199  x"  1424745952102609  x"  77984295979769  x" 
13638559334400  ^  9206027550720000  '*'  2343352467456000 
106234847821 1833  x"  1 8674864899  x"  2462417656967  x" 
175751435059200000  ^  20324995891200  21341245685760000  ^ 
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141248912237  853529939221x21  4394656999x^2 

12014330904576000  901074817843200000  75690284698828800 

158824242127x22  9700106723x2^ 

62444484876533760000  ^  136783157348597760000 
9700106723  x^^ 

10258736801 144832000000 

3.16  Availability  of  software. 

The  url  http :  /  /mixing .  sp .  cs .  emu .  edu/  provides  an  online  demonstra¬ 
tion  of  this  work  over  the  World  Wide  Web. 


3.17  Problem  hardness. 

The  problem  of  determining  the  number  of  contingency  tables  consistent  with  row 
and  column  totals  is  quite  difficult.  Martin  Dyer  and  Ravi  Kannan[24]  established 
the  following. 

Theorem  9  Determining  the  number  of2  xn  non-negative  integer  matrices  X 
such  that  Xi^j  =  (i  =  1,2/  and  Xij  +  X2,j  =  Cj  (j  =  I  •  n)  is  ttT*- 
complete. 

Proof:  In  [22]  it  is  shown  that  given  the  positive  integers  ai,  a2,  •  •  • ,  o-n-i ,  6  it  is 
j|P-hard  to  compute  the  n  -  1  dimensional  volume  of  the  polytope  defined  by 

Ej=l  ajVj  <  b 

0<yj  <l  (i  =  1,2,  •••n-  1)  ■ 

Thus  if  =  6  it  is  t|P-hard  to  compute  the  n  -  1  dimensional  volume 

E”=i  djVj  =  b 

0<yj  <l  (i  =  1,2, •••n)  ■ 

Substituting  xij  =  Ujt/j,  X2,j  =  aj{\  -  yj)  we  encode  this  as  a  small  set  of 
contingency  table  problems.  We  identify  Cj  =  aj  {j  =  l,---n),  ,ri  =  b 
and  r2  =  Ej=i  «i-  We  then  use  the  fact  [34]  that  for  integer  k  the  function 
/(fc)  =  tt(fcr,  fee)  is  a  degree  n  —  1  polynomial  in  k  and  the  coefficient  of  is 


(3.28) 
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the  desired  volume.  Thus  if  we  could  solve  the  contingency  table  counting  problem 
for  =  0, 1 , 2  •  •  •  n  —  1  we  could  infer  all  of  the  coefficients  of  /,  including  the 
highest  one.  It  is  easy  to  see  that  the  interpolation  can  be  performed  using  rational 
numbers  that  require  no  more  than  0(n(log(ri  +  r2)  +  n))  bits  to  represent.  □ 


3.18  Near  uniform  generation. 


Consider  m  >  1  row  by  n  >  1  column  contingency  tables  obeying  row  and 
column  sums  (rj,  Cj  >  0  and  integer)  such  that  (^  =  1  •  •  •  m  —  1)  and 

Cj+i  >Cj  (j=  l---n-l). 

Letr  =  Er=in  =  E”=iC,-, 


K  = 


(1) 

1 

Xi,j 

< 

Cj 

3 

=  1- 

•  •  n 

(2) 

Xij 

< 

Vi 

i 

=  1  •• 

•  m 

(3) 

—  l  v-^n— 1 
Lji-i  2^j=i 

Xij 

> 

T  V  m  Cfj 

(4) 

> 

0 

3 

=  1  • 

•  •  n 

i 

=  !•• 

•  m 

There  is  an  obvious  1  —  1  correspondence  between  integral  points  in  K  and 
contingency  tables  obeying  the  given  row  and  column  sums  (fill  in  last/row  column 
using  the  linear  dependencies).  For  u  g  let 


€  R(”»-l)(n-l) 


'^i,i  2  —  ^  2  ^  —  \  •  •  •  n 

i  =  1  •  •  •  m  —  1  J 


and 

j  =  u  C(u) 

Ug7(m-l)(n-l)  n^£: 

Now  we  wish  to  find  a  convex  body  K'  such  that  J  C  K'  and  Vol(/C')  is  not  too 
much  bigger  than  Vol(  K).  If  we  had  such  a  body  then  to  generate  contingency  table 
(from  near  uniform  distribution)  we  would  use  standard  techniques  to  generate  a 
near  uniform  point  from  K'  and  rejection  sample  until  we  find  a  point  in  J  and 
round  to  an  integral  vector. 

Define 

y.  .  =  q.  niin(r,-,  Cj) 

T  2max(n2  —  n,  —  m) ' 
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Lemma  9  The  coordinate  aligned  cube  of  (loo)  diameter  2msx{!^-n‘^S-m)  centered 
at  Y  is  contained  in  K. 

Proof;  It  suffices  to  check  inequalities  (1)  and  (2)  with  X  =  Y'^  and  inequalities 
(3)  and  (4)  with  X  =  F"  where  =  Fj  +  2^  and  YQ  =  - 

min(re,C7)  /_  rjCj  \ 

2max(n2— n,m2— m)  V  T  / 

1. 

f  j _ niin(r^,  Cj) _ \  ^  ~  '^m)  ,  Cj(?72  —  1) 

\  T  max(n^  -  n,  —  m)  y  “  T  w?  —  m 

<  Cj(m-l)  ^  Cj(m-l) 

~  m  m?-  —  m 

2. 

y  (ri^_^  min(n,cj)  \  ^ 

max(n2  —  n,m^  —  m)y  ~ 

< 

3. 

_  (T  -  rnt)(r  -  Cn) 

T 

m  ^  771  ^2 

=  T-rm-Cn  +  — 

>  T  V  jn  Cn- 


m— 1  77—1 

EE 

7=1  i-1 


7^i 


^i{T  Cyj,)  1) 

T  ^  n^-n 

rjjn  -  1)  n(n  -  1) 

n  v?-  —  n 

Ti. 


4. 


□ 


rifi 

T 


>0 


Corollary  2  ^min(ri,ci)  >  2X(n  —  l)(m  —  l)niax(n^  —  n^m?  —  m)  then  the 
dilation  K'  of  K  about  Y  by  contains  J  and  <  e*^^. 
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Proof:  If  min(ri,  Cl)  >  2Anmmax(n^  —  —  m)  then  by  the  last  lemma  we 

know  that  a  coordinate  aligned  cube  of  size  A(n  —  l)(m  —  1)  centered  at  Y  is 
contained  in  K.  Consider  any  c  €  and  b  real  such  that  c-  (x  —  Y)  < 

6  Vx  e  K.  Take  an  arbitrary  2;  such  that  c  ■  {z  —  Y)  <  b  and  notice  that 
C{z)  is  contained  in  the  half  space  c  •  [x  —  Y)  <  (1  +  l/(A(n  -  l)(m  -  1))). 
From  this  it  is  easy  to  see  that  K'  contains  J.  The  volume  of  K'  is  precisely 
(1  +  l/(A(n  —  l)(m  —  l)))(”“*hm-i)Vol(/T)  which  is  no  more  than  e*/^Vol(/T). 
□ 

Let  tl(/T)  denote  the  number  of  points  in  K  with  integer  coordinates. 
Corollary  3  .^min(ri,ci)  >  2Anmmax(n^  —  n,m?'  —  m)  then 
e-'/^Vol(/T)  <  i(/T)  <  e^/^Vol(/T). 

Proof:  The  right  inequality  follows  immediately  from  the  last  corollary.  Consider 
any  c  €  and  b  real  such  that  C'{x  —  Y)  <  6  Vx  G  K.  Take  an  arbitrary 

2;  such  that  c-{z  —  Y)  <  b  notice  that  (7(2)  is  in  the  complement  of  the  half  space 
c  •  (z  -  F)  <  (1  -  l/(A(n  -  l)(m  -  1))).  □ 


3.19  Counting  from  generation. 

The  corollaries  of  the  previous  section  extend  into  an  approximate  counting 
scheme,  if  we  had  a  method  of  approximating  the  ratio  of  the  number  of  ta¬ 
bles  with  row/column  totals  (r,  c)  (r,  >  nmmax(n^  —  n,m^  —  m),  Vj  > 
nm max(n^  —  m)  i  =  1  •  •  •  m,  j  =  1  •  •  •  n)  to  the  number  of  tables 

with  row/column  totals  (r',  c') 


,  J  Tj  +  A 

i  =  a 

otherwise 

J  -  1  ^3+^ 

j  =  b 

-  1  c. 

otherwise 

for  arbitrary  a,  b  and  some  non-negligible  A. 

Suppose  we  wish  to  count  to  within  a  relative  error  of  1  +  e  (0  <  e  <  1),  let 
=  1^ j  above  arguments  the  number  of  integer  tables 

is  between  e“'/^°Vol  and  e^/’°Vol.  This  means  we  certainly  have  the  volume 
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of  this  body  approximates  the  number  of  lattice  points  with  a  relative  error  of 
no  more  than  1  +  e/3,  volume  of  a  table  with  all  row  sums  and  column  sums 
>  N  approximates  the  number  of  lattice  points  to  within  a  relative  error  of  1  +  j. 
All  that  remains  would  be  to  use  standard  technique  (as  in  Lovasz/Simonovits 
or  Kannan/Lovasz/Simonovits)  to  calculate  the  volume  of  this  body  to  sufficient 
accuracy  and  approximate  the  ratios  of  lattice  points  as  outline  above. 

If  we  could  boost  all  and  Cj  up  to  at  least  A  in  a  polynomial  number  of 
stages  and  compute  the  ratios  of  all  these  stages  so  that  the  product  of  these  ratios 
has  a  relative  error  of  no  more  than  1  +  f  then  we  would  have  a  good  estimate 
of  the  number  of  tables  obeying  the  original  row  and  colunrn  sums.  We  will 
show  that  we  can  take  A  =  [min(  ^ ^  when  trying  to  boost  a  row  and  guarantee 
that  the  ratio  of  contingency  tables  obeying  the  original  row/colurrm  totals  to  the 
boosted  row/column  totals  is  in  the  range  >  !]•  This  is  enough  to 

allow  us  to  approximately  count  using  no  more  than  0  (nm  ln(max(/^,r,„))) 

(for  m  rows  double  each  row  in  about  n  stages,  raise  row  to  max(A,  r^)  in  about 
ln(max(N’rm))  doublings)  each  being  computed  to  a  relative  error  no  finer  than 

Q  ( - — ,  .■nax(N,r,„)-: )  yielding  an  approximation  scheme  that  runs  in 

time  polynomial  in  n,  m. 

boosting  a  table  To  compute  the  ratio  of  tables  obeying  (r,  c)  to  (r',  c')  (as 
defined  before)  we  identify  tables  X  obeying  (r,  c)  with  X'  (equals  X  with  Xa,b 
increased  by  A)  this  defines  1-1  map  from  tables  obeying  (r,  c)  to  a  subset  of  the 
tables  obeying  (r',  c').  So  we  generate  a  table  X'  from  the  uniform  distribution 
that  obeys  the  (r',  c')  and  check  if  we  are  in  the  subset  corresponding  to  the  (r,  c) 
tables  (ie.  is  A'  ^  >  A?).  All  that  remains  is  to  show  that  this  subset  of  all  (r',  c') 
tables  is  not  too  small. 

Theorem  10  If  X  is  uniform  random  variable  corresponding  to  a  table  obeying 
the  row/column  sums  (r,  c)  then  Xa,b  >  m)]  (ro-i)(n-i)+i 

time. 

Proof:  Let  A  =  min(  ^ ,  ^)  -  We  will  define  a  map  /  that  maps  all  tables  obeying 
(r,  c)  into  the  set  of  tables  X  obeying  (r,  c)  such  that  Xa,h  >  A.  If  Xafi  >  A  let 
f{X)  =  X  otherwise  by  the  pigeonhole  principle  we  see  that  there  exists  x,  y  such 
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that  Xa,y  >  [^J  and  X^,  .b  >  [^J-  Let  i,i  be  the  smallest  indices  obeying  these 
inequalities  (i  ^  a,j  ^  b).  And  define 


-t-A 

if  u  = 

a  and  v 

=  b 

+  A 

if  u  = 

X  and  V 

=  y 

-A 

if  u  = 

X  and  V 

=  b 

-A 

if  u  = 

a  and  v 

=  y 

^UyV 

otherwise 

most  ( 

m  — 

l)(n- 

1)  +  1  tables 

range.  □ 


a  trick  to  speed  it  up  We  can,  in  some  cases,  speed  up  the  reduction  by  avoiding 
the  volume  computation.  Assume  n  =  m  and  let  Ki  denote  the  polytope  corre¬ 
sponding  to  Vi  =  /,Cj  =  I  for  all  i,j.  Notice  the  polytope  Ki  is  just  IK\.  Also, 
Ki  has  only  integral  vertices  because  the  matrix  formed  by  the  left  hand  side  of 
the  constraints  given  in  the  definition  of  K  is  totally  unimodular.  Thus  the  theory 
of  lattice  point  enumerators  ([34]  pp.  135-140)  can  be  brought  in  and  we  see  that 
a  polynomial  pi  of  degree  exactly  (m  —  l)(n  —  1)  passes  through  the  integers 
ti(/T/)  /  =  1  •  •  •  oo.  So  if  we  knew  pi  (as  a  non-uniform  piece  of  information  in 
the  circuit  theory  sense  or  from  an  oracle)  we  would  not  have  to  boost  the  table 
until  all  rows  and  columns  were  above  N  but  only  until  they  were  equal  (in  fact  we 
could  even  decrease  them  all  be  be  min(r  i ,  ci )  allowing  the  more  natural  reduction 
direction:  towards  smaller  problems).  Now  by  a  brute  force  technique  (similar 
to  the  one  developed  by  the  authors  of  STATEXACT  [42])  we  have  explicitly 
computed  p/  for  2  x  2, 3  x  3, 4  x  4,  5  x  5  and  6x6  tables  (which  can  be  used  to 
help  enumerate  any  tables  with  max(n,  m)  <  6. 


3.20  Difficulty  in  generating  Fisher  Yates. 

Generating  Fisher- Yates  distributed  tables  in  time  polynomial  in  log{T)  (instead 
of  polynomial  in  T)  seems  to  be  more  difficult. 

The  main  problem  is  that  the  density  function  varies  quite  rapidly.  It  is  easy 
to  show  the  density  function  is  log-concave  (replace  a:!  with  r(a;  -|- 1)  everywhere 
and  notice  that  is  non-negative  for  a;  >  0  [1]  6.4.10).  But  if  a 

step  is  taken  such  that  an  entry  is  increased  from  0  to  d  the  density  function  can 
vary  by  as  much  as  a  multiplicative  factor  of  d. 
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In  fact  the  density  function  can  go  through  an  incredible  range.  Consider  a 
n  by  n  table  with  row/column  suras  all  equal  to  T/n  (assume  divides  T)  and 
consider  two  fill-ins  E  where  Eij  =  Tjn^  and  U  where  Ui^i  =  Tln  and  Uij  =  0 
for  i  /  j. 


/Ti\” 

D{E)  ^  (?0 

D(U)  (r,)”- 

n'^+n/2 

=  (V2r)”“-" 

which  grows  as  for  large  enough  fixed  n  and  growing  T. 

3.21  Local  geometry  of  Fisher  Yates. 

3.21.1  Bounds  on  variation 

Consider  m  row  by  n  column  contingency  tables,  with  row  sum  vector  r,  column 
sum  vector  c  and  non-negativity.  Take  the  function  F  defined  over  all  such  tables 
X  such  that: 


F(X)  =  5;inr(Xij  +  l)  (3.29) 

hi 

We  note  that  for  X  >  0  that  F  is  strictly  convex  [32,  8.363  8]. 

Lemma  10  If  X  is  a  legal  table  and  X  >  0  then  for  any  A  such  that  AT  +  A  w  a 
legal  table  and  X  +  A  >  0 

F{X  +  A)  -  F(X)  <  E  +  E  +  1)  (3.30) 

hi  i,j 
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Furthermore,  if  for  alii,  j  AijKXij  +  1)  >  — 3/4  then 


F{X  +  A)  -  F{X)  > 


E 


A? 


5Xij  +  3 


+  +  !)• 

hi 


(3.31) 


Proof:  First  define: 


fit)  =  Fix  +  tA)  =  Elnr(X;j  +  1  +  tAij) 

hi 


then  by  [32,  8.360] 


hi 


and  [32,  8.363  8] 


Also 


/"(()  =  Y.  +  1  +  iAij) 

hj  • 

OO  J 

i,j  S  1 


/'(<)  =  /(o)  +  jf /"(y)d!/ 


(3.32) 


(3.33) 


To  prove  inequality  3.30  assume  that  A  >  0  and  26  +  A  >  0.  Continuing  from 
equation  3.32 


/"(<)  <  E  ^4° 


dy 


hi  : 

=  E 


(26,  j  +  tAjj  +  y)2 


A?. 


i,j  :  A, ,>5^0  ^hi  + 
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Combining  with  equation  3.33 


A?. 

h3 


hi 


*4“  y^z,. 


dy  +  +  1) 


A2j(ln(X^j  +  ^^,i) 

* J  • 

hi 


So  it  is  enough  to  get  an  upper  bound  on: 


ij  ■  Ai,i#0 

=  53  (("^*.i + ^,j)  +  ^'j)  “ 

«J  :  Ai,j^0 

(Xjj  +  Aij)  -  Aij  In(Xij)  -  Xij  In(Xij)  +  Xij) 

=  Y1  +  ^.i)  - 

=  53 


(the  free  A^j  disappears  from  that  second  to  last  line  because  Y^ij  ■  A,,j  =  0). 
Continuing  with  [1, 4.133],  ln(l  +  a;)  <  a:  for  a;  >  — 1: 


< 


-*2, a 


hi  ■  Ai.j5i0 


X- 


hJ 


E  (^^  +  1^) 


hJ  -  ^ij 

A? ' 

E  ^ 

« J  :  Ai,j#0  ^*>a 


To  prove  inequality  3.31  further  assume  that  Aijf{Xij  +  1)  ^ 
from  equation  3.32 


fit) 


dy 


hi  :  A<,j5^0 


{Xi^j  +  1  +  tAjj  +  yY 


/4.  Again 
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E 


A?. 

hJ 


i,j  : 

Combining  with  equation  3.33 


Xi  j  +  1  + 


/'(<)  >  /'  E 


A? 


■dy 


^,3  : 1 

=  13  ^.i(M^«.i  +  1  +  ^Aij)  -  ln(Xj,j  +  1)). 

h3  ■ 

Now  it  is  enough  to  get  a  lower  bound  on: 

/'  (ln(A:,j  +  l+(A-,i)-ln(Ari,,  +  l))Ai^di 

•Jo  .  •  .A  -zrt 

=  53  +  1  +  Aij)  ln(Xj'j  +  1  +  Ai,j)  — 

(Xjj  +  1  +  Ajj)  -  Ajj  ln(Xij  +  1)  - 

(Xij  +  1)  ln(Xij  +  1)  +  Xi^j  +  1) 

=  ^  (Xij  +  1  +  A;j)(ln(Xij  +  1  +  Ajj)  —  ln(Xij  +  1)) 

'  ^i,j¥^0 


=  XI  +  1  +  ^‘j)  In  ( 1  + 


‘  Ai  0-7^0 


Xij  +  1  ^ 


(the  free  Aij  disappears  from  that  second  to  last  line  because  :  Aij^o  —  0)- 
Using ln(l  +  x)  >  x/{l  +  |x)  fora:  >  —3/4. 


>  Y1  (^i.i  +  1+Ai^) 

=  53  1 


'^t.7  + 1 


1  + 


2  ,j 

Xi,,+1 


-X'ij'+l 


=  E  ^ 

hi  '■  AijT^O 

=  53  I  ^  + 

*ij  :  Ai,i#0 


Xi  j  +  1  +  A; 


-^i,7  +  1  +  Aj  j  —  ^ 


3 


Xij  +  1  +  Aj, 


A. 
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=  E 

« J  :  Ai,j 

>  E 


A2  . 
_bi. 
3 


i,j  :  1 

A?. 


^■.7 

3 


5Xij  +  3 


□ 


ij  :  Ai,,-540 


We  have  the  following  theorem: 


Theorem  11  IfX  is  a  legal  table  minimizing  equation  3.29  (over  all  legal  tables) 
and  X  >  0  then  for  any  A  such  that  X  -\-  JS.  is  a  legal  table  and  X  +  A  >  0 

F{X  +  A)  -  F{X)  <  E  ^  (3.34) 

hi  ^*4 

Furthermore,  if  for  alii.,  j  A;j/(X,j  +  1)  >  —314  then 

F{X+A)-F(X)>Y^  (3.35) 


Proof:  It  is  clear  that  J2i,j  +  1)  =  0  in  this  case  and  we  get  the  result 

from  lemma  10.  □ 


3.21.2  Location  of  the  minimum 

Let  X  be  a  m  by  n  non-negative  matrix  obeying  row  sums  r,  column  sums  c  and 
T  =  YhL\  f'i  =  Tfj=\  Cj-  Define  a  m  by  n  matrix,  S{X),  as  follows: 


X 


h3 


>^  +  l 

X,j  <  ^  -  1 


+l 

-1 

0  otherwise 


(3.36) 


We  call  an  m  by  n  matrix  A  a  “non-trivial  balanced  sub  marking”  of  S{X)  if: 
•  3*  G  [1  •  •  •  m]  3j  G  [1  •  •  •  n]  s.t.  Ajj  f  0  (“non-trivial”) 

^  ViG[l---m]  Ei=iA,-„  =  0 


VjG  [!•••«]  Er=iAij  =  0 


(“balanced”) 
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•  V*  e  [1  •  •  •  m]  Vj  e  [1  •  •  •  n]  (Aij  =  S(X)ij)V(Aij  =  0)  (“sub-marking”). 

Lemma  11  Tbfe  X  is  a  m  by  n  non-negative  matrix  obeying  row/column  sums 
r,c,  T  =  Yli'f'i  =  ^  ^  non-trivial  balanced  sub  marking  of  S{X)  and 

Vi,  j  ^  >  1.  Then 

F{X)  >  F{X  -  A). 


Proof:  Let  aj  =  Xij  —  -|-  Ajj).  Notice  that  (e,,j  <  0)  — ^  (Ajj  <  0)  and 

(e;,j  >  0)  ->  (A;j  >  0).  Then 


F{X)-F{X-A)  = 


> 

> 


E(lnr(^  +  A,j  +  e,,  +  l) 

hi  '  ^  ^ 

lnr(^+£y  +  l)) 

E  in  +  l)  - 

E  ln(^  +  Q.) 

.  E  '■'(^)-  E  to 
E  to(!f)-  E  to 

hj  •  i^j  :  At^j<0 

0. 


□ 


Lemma  12  If  S{X)  is  such  that  every  row  and  every  column  has  at  least  one  -|-1 
and  at  least  one  —  1  then  S{X)  has  a  non-trivial  balanced  sub  marking. 


Proof:  We  can  assume  (without  loss  ofgenerality)  that  5” (X)  1,1  =  +l,5(X)i,2  = 
—  1  andS'(X)2,i  =  —1.  This  gets  us  into  the  general  case  used  in  this  proof  where 
we  have  A^  the  m  by  n  matrix  such  that 


\  +1  (i,i)  =  (l,l) 

^  (_l)min(i,i)  (1  <i<fc)A(l  <i<A:)A(i=  j±l) 
0  otherwise 
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and  ((A/c)ij  7^  0)  ((Afc)tj  —  S{X)ij),  We  are  going  to  use  an  inductive 
argument  on  k.  One  such  pattern  (n  =  m  =  5,  fc  =  4)  is: 

■  +1  -1  0  0  0  ’ 

-1  0+1  0  0 

0+1  0-10. 

0  0-1  0  0 

_  0  0  0  0  0 

Now  if  we  know,  as  above,  the  initial  A:  by  A:  segment  or  S{X)  we  know  (by 
the  given)  there  must  be  a  (-1)^  somewhere  in  column  k.  If  S{X)k,k  =  (- 
then  we  see  that  if  (-1)*'  is  added  to  the  {k,k)  position  of  Ak  then  we  have 
the  desired  non-trivial  balanced  sub  marking.  If  S{X)i^k  =  {—  Y  for  some 
i  <  k  -  \  then  we  see  that  if  we  add  (-I)*'  to  the  (i,  k)  position  of  A*,,  zero 
out  the  column  a  <  k  such  that  =  (-1)^  and  then  iteratively  zero  out  all 

columns  that  have  the  only  non-zero  entry  for  any  row  then  what  we  have  left  is 
again  the  desired  non-trivial  balanced  sub  marking.  An  example  of  this  process  is 
(n  =  m  =  5,  A;  =  5,  S'(A)i,5  =  —1): 

■  +1  0  0  0  -1  ' 

-10+10  0 
0  0  0  0  0  . 

0  0-10+1 
0  0  0  0  0 

So  either  k  =  m  and  we  must  have  a  non-trivial  balanced  sub  marking  or  we 
can  assume  (by  reordering  rows)  that  S{X)k+i,k  =  (-!)*'•  By  a  similar  column 
argument  we  see  that  either  we  have  the  desired  marking  or  (by  reordering  columns) 
S'(X)fc,fc+i  =  (-1)*'.  Thus  we  can  say  that  either  we  are  done  or  A^+i  is  such  that 
((Afc+i)jj  ^  0)  — ^  ((Afc+i)jj  =  S{X)ij).  □ 

Theorem  12  If  X  is  the  m  by  n  non-negative  matrix  obeying  row/column  sums 
r,  c  minimizing  equation  3.29  over  all  such  tables  and  T  =  YT=\  ~  ^i=i 
then  for  all  i^j: 


\Xi,j  -  ^1  <  (m  +  n  -  3)max(m  -  l,n  -  1). 
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Proof:  We  proceed  by  induction  on  m  and  n.  If  m  or  n  is  less  than  two  then 
the  theorem  is  trivial.  For  m  =  n  =  2  the  theorem  follows  immediately  from 
lemmall.  Assume  the  theorem  is  true  for  all  a,  6  s.t.  ((a  <  m)V(6<  n))A(l  < 
a  <  m)  A  (I  <  b  <  n).  Suppose  every  row  and  every  column  of  X  has  an  entry 
differing  from  by  at  least  max(m  —  1 ,  n  —  1 ) .  Then  S (X )  must  satisfy  the 
conditions  of  lemma  12  allowing  us  to  again  apply  lemma  11.  So  we  assume  that 
there  is  some  row  or  column  where  every  entry  is  within  max(m  —  1 ,  n  —  1 )  of 
for  discussion  assume  it  is  row  m.  Notice  that  the  m  —  1  by  n  initial  submatrix  of 
X  is  the  unique  matrix  minimizing  equation  3.29  over  all  non-negative  m  —  Iby 
n  matrices  satisfying  row  sums  r  and  colunm  sums  Cj  —  Xmj.  By  our  inductive 
hypothesis  we  know  that  \Xij  —  I  ^  niax(m  —  2,  n  —  l)(m  +  n  —  4) 

for  i  =  1  •  •  •  m  —  1.  We  quickly  see  that  for  i  <  m  we  have  \Xij  —  < 

{t^  +  m  +  n  -  4)  max(m  -  1,  n  -  1).  And  for  i  <mv4Q  have  <1.0 
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Chapter  4 

Open  Problems 


We  would  like  to  identify  some  important  open  problems  in  the  areas  touched  on 
in  this  thesis.  For  the  Markov  chains  (used  in  both  the  optimization  chapters  and 
contingency  table  chapters  of  the  thesis)  a  number  of  important  questions  remain 
open. 


4.1  Effective  diameter. 

Most  of  the  current  upper  bounds  on  the  mixing  time  for  Markov  chains  with 
average  step  size  5  whose  states  are  contained  in  a  convex  body  of  diameter  d 
involve  a  factor  of  {djSY.  It  would  be  a  big  breakthrough  to  be  able  to  measure 
both  d  and  5  in  the  infinity  (or  max)  metric  without  introducing  factors  of  n.  It 
would  also  be  a  big  improvement  to  be  able  to  use  the  “effective  diameter”  or  the 
expected  distance  between  two  points  in  K  instead  of  d  which  is  the  maximum 
distance  between  any  two  points  in  K.  Any  of  these  improvements  would  lead  to 
much  better  bounds  for  mixing  time. 


4.2  Non-stationary  processes. 

A  method  to  analyze  non-stationary  processes  would  be  very  useful.  This  would 
not  only  allow  us  to  directly  prove  mixing  rates  for  simulate  annealing  but  would 
allow  the  development  of  adaptive  algorithms.  An  adaptive  algorithm  would  start 
with  a  simple  Markov  chain  that  could  have  a  poor  mixing  rate  due  to  states  with 
low  local  conductance.  Such  a  chain  could  be  repaired  when  detected  by  forcing 
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smaller  steps  near  the  bad  states  encountered.  If  good  time  bounds  could  be 
developed  for  such  a  scheme  then  one  would  only  have  to  run  the  Markov  chain 
for  a  time  proportional  to  the  rate  bad  states  are  actually  encountered  instead  of 
proportional  to  a  weak  upper  bound  on  this  rate. 


4.3  Convergence  acceleration. 

The  linear  operator  interpretation  of  Markov  chains  opens  the  question  if  any  of  the 
many  methods  of  convergence  acceleration  methods  used  to  accelerate  iterative 
linear  systems  can  be  applied  to  Markov  chains.  I  do  not  know  if  this  is  possible 
but  have  thought  about  applying  Chebyshev  acceleration[35]  to  a  Markov  chain. 

The  idea  in  Chebyshev  acceleration  is  that  if  one  has  a  process  Xi+i  = 
{XiP)/{\\XiP\\^)  that  starts  with  Xo  and  P  has  a  imique  eigenvector,  tt,  with 
(largest)  eigenvalue  1  such  that  tt  =  lim<_),oo  Xt.  Then  if  one  applies  the  operator  t 
timestogetthevectorsA’o,Xi,-  •  -XfthenXf  is«otthebestestimatefor;r.  There 
is  a  estimate  tt  =  where  the  A;  are  explicit  constants  independent  of  Xi 

and  P  that  is  a  better  estimate  of  tt  than  Xt  is.  Not  all  the  A^  are  non-negative. 

The  problem  for  Markov  chains  is  that  we  realize  Xi  as  a  probability  vector,  so 
the  subtractions  needed  to  compute  ^  have  no  natural  interpretation  in  this  context. 
In  addition  the  Markov  chain  is  only  able  to  estimate  Xi-  so  if  the  Xi  are  large  (and 
they  are)  then  errors  in  estimating  Xi  soon  make  tt  unusable.* 


4.4  Unary  polynomial  time  in  row/column  sums. 

For  the  contingency  table  problem  it  would  be  nice  to  develop  chains  that  ran 
in  unary  polynomial  time  in  T,  that  total  of  all  rows  and  columns.  The  current 
inability  to  do  this  represents  a  major  weakness  in  current  uniform  generation 
techniques  for  contingency  tables.  A  unary  polynomial  time  algorithm  to  generate 
contingency  tables  from  the  uniform  distribution  would  be  useful  for  two  simple 
reasons.  First,  even  though  T  can  be  vary  large  in  principle  it  is  typically  a 
number  that  some  statistician  has  counted  up  to.  In  practice  many  contingency 
tables  are  constructed  as  summary  statistics  of  lists  of  people.  So  even  though  it 

•chebyshev  acceleration  seems  to  improve  bias  at  the  expense  of  amplifying  variance.  In 
explicit  eigenvector  problems  the  only  source  of  variance  is  rounding  error-  so  this  is  a  good  trade. 
In  random  processes  the  variance  is  large  to  begin  with. 
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takes  no  more  that  nm  log(T)  space  to  write  down  such  a  table  it  often  took  the 
statistician  time  proportional  to  T  to  generate  it.  Thus  an  algorithm  that  runs  in 
time  polynomial  in  T  would  be  very  useful  to  a  statistician,  though  the  computer 
scientist  would  of  course  like  to  see  one  that  runs  in  time  polynomial  in  log(r). 
In  this  same  vein  there  is  an  obvious  and  efficient  method  to  generate  tables  from 
the  Fisher- Yates  or  hypergeometric  distributions  in  time  polynomial  in  T.  The 
method  is  just  to  build  a  large  complete  bipartite  graph  Avith  T  left  nodes  and  T 
right  nodes.  One  labels  each  right  node  with  an  index  from  1  •  •  •  m  such  that 
right  nodes  have  index  i.  One  labels  each  left  node  with  an  index  from  1  •  •  •  n  such 
that  Cj  right  nodes  have  index  j.  Then  one  chooses  at  random,  from  the  uniform 
distribution,  a  perfect  matching  of  the  left  to  right  nodes.  The  matrix  X  such  that 
Xi^j  is  the  number  of  right  nodes  marked  with  i  that  are  matched  up  with  left  nodes 
marked  with  j  is  then  a  contingency  table  consistent  with  (r,  c)  and  is  Fisher- Yates 
distributed.  It  would  be  nice  to  have  a  comparable  algorithm  for  the  uniform  case. 


4.5  Higher  way  contingency  tables. 

Contingency  tables  representing  relationships  between  more  the  two  variables 
have  been  proposed.  There  is  some  freedom  in  defining  what  constraints  are  in 
three  (and  higher)  way  tables.  If  we  define  fc— way  contingency  tables  to  be  k 
dimensional  block  of  non-negative  integers  constrained  such  that 

H 


where  the  constants.  Then  there  is  no  known  scheme  for 

approximately  counting  the  number  of  constrained  3 -way  tables  as  this  would  be 
sufficient  to  approximately  compute  the  general  permanent.^  We  show  this  by 
encoding  the  set  of  perfect  matchings  of  an  arbitrary  bipartite  graph  in  a  3 -way 
table.  Let  G  be  a  bipartite  graph  with  vertex  sets  V\ ,  Vi  and  edges  sqXE  CVyxVi. 
We  define  X,  a  3  way  table  with  dimensions  IVj  |  x  IV2I  x  2,  and  set 

c+j,i  =  1 

—  1 

^Sinclair  and  Jerrum  settled  the  question  of  computing  the  dense  permanent,  the  general  case 
remains  open. 
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^  i  1  (^i)  e  E 

y  0  otherwise 

C*  J,2  =  ^  C,  J,*  —  1 

i 

Cj,*,2  =  ~  ^ 

3 

Then  for  any  X  obeying  these  constraints  we  see  that  the  set  of  (i,j)  such  that 
Xij^i  0  is  a  perfect  matching  of  G  and  all  perfect  matchings  of  G  can  be  so 
encoded. 

Similarly,  a  4-way  table  has  enough  structure  to  encode  a  three  dimensional 
matching,  so  the  decision  problem  itself  is  NP  complete. 

4.5.1  Simpson’s  paradox 

One  might  wonder  if  there  is  any  actual  necessity  for  higher  way  tables.  Perhaps 
one  could  collapse  the  information  in  a  3  way  table  into  a  two  way  table  with  a  com¬ 
parable  number  of  cells  and  still  perform  a  meaningful  analysis.  We  present  here 
an  example,  exploiting  the  well  known  Simpson’s  paradox,  that  should  illustrate 
that  this  is  not  the  case. 

Consider  the  3  way  table  in  table  4.1,  which  could  arise  from  a  drug  trial  run 
in  two  cities. 


City  A  City  B 


Drug 

67 

33 

Drug 

550 

450 

Control 

590 

410 

Control 

50 

50 

Good  Bad  Good  Bad 


Table  4.1 :  Drug  trial  in  two  cities. 


This  data  can  obviously  be  organized  into  a  3  way  table  indexed  by  treatment 
(drug/control),  result  (good/bad)  and  city  (A/B).  The  obvious  way  to  pair  this  data 
into  a  two  dimensional  table  would  be  to  add  the  city  A  data  to  the  city  B  data  to 
form  a  summary  table  like  table  4.2. 

The  problem  is  that  there  is  no  way  to  tell  if  important  information  has  been 
thrown  away  in  this  summarizing  step.  In  table  4. 1  the  drug  has  a  lower  rate  of  bad 
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effects  than  control  in  both  city  A  and  city  B.  However  in  table  4.2  we  see  that  drug 
has  a  higher  rate  of  bad  effects  than  control  over  all.  It  is  impossible  to  determine 
from  only  examining  the  data  if  summarizing  table  4. 1  into  table  4.2  was  legitimate. 
If  one  believes  that  the  populations  in  city  A  and  city  B  are  identical  and  that  the 
selection  process  that  divide  people  into  drug  and  control  trials  was  independent 
of  any  properties  of  the  city  populations  (like  only  city  B  got  funding  for  a  big 
drug  trial)  then  one  might  want  to  use  the  summary  data.  If  one  does  not  believe 
that  the  cities  have  identical  populations  and  an  indifferent  selection  mechanism 
then  one  can  not  draw  any  conclusions  because  for  all  practical  purposes  city  A 
ran  only  a  control  group  and  city  B  ran  only  a  drug  group. 
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Cities  A  +  B 
Drug 
Control 


617 

483 

640 

460 

Good  Bad 


Table  4.2:  Drug  trial  summary. 
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Appendix  A 
Notation 


Notation  used  in  this  thesis. 

Br{x) 

Cr{x) 

C(u) 

QTf{x) 

T{x) 

Lin(a:i,---,a;„) 

T{v\i'  '  '  Vyn ,  Cj ,  •  ■  •  C,j) 

Qn 

Q”+ 

R” 

R"+ 

m 

tt(r;c) 

<  X\,---,Xn  > 

\x\ 


The  ball  of  radius  r  centered  at  x. 

The  cube  of  side  2r  centered  at  x. 

Shorthand  for  Ci/2{x). 

Kronecker  delta,  1  if  i  =  j,  0  otherwise. 

The  i’th  standard  unit  vector,  Ej  =  Sij. 

Error  function. 

Euler  gamma  function. 

Linearity  space  of  x\,  -  •  •  ,Xn. 

The  set  of  all  m  by  n  non-negative  integer  matrices 
with  rows  sums  ri ,  •  •  •  and  column  sums  ci ,  •  •  • ,  c„ 
Rational  n  space. 

The  set  of  a;  G  Q"  such  that  a;  >  0. 

Euclidean  n  space. 

The  set  of  a;  G  R”  such  that  x  >0. 

The  standard  integer  lattice  in  n  space. 

The  number  of  elements  in  the  set  X. 

Shorthand  for  tl(T(r;  c)). 

Group  generated  by  aji ,  •  •  • ,  a;„. 

The  number  of  elements  in  the  set  X  or  length  of  X 
(depending  if  A  is  a  set,  scalar  or  vector). 
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Appendix  B 


Complete  3x3  solution 


The  complete  enumeration  oracle  for  3  by  3  contingency  tables  is  used  as  follows. 
One  starts  with  row  sums  [xi,  X2, 0:3]  and  column  sums  [0:4,  X5,  X1+X2+X3-X4-X5] 
and  arranges  them  (by  row  swaps,  columns  swaps  and  transpose)  to  obey  all  of 
the  relations  in  table  B.l  (the  table  margins  are  sorted  so  the  row/column  demands 
are  non-decreasing  and  the  first  row  demand  in  no  bigger  than  the  first  column 
demand). 


[ 

1 

0 

0 

-1 

0 

]• 

‘  X 

< 

0 

[ 

-1 

-1 

-1 

1 

2 

]■ 

■  X 

< 

0 

[ 

0 

0 

0 

1 

-1 

]■ 

■  X 

< 

0 

[ 

0 

1 

-1 

0 

0 

] 

*  X 

< 

0 

[ 

1 

-1 

0 

0 

0 

] 

•  X 

< 

0 

Table  B.l;  Global  relations  for  3  by  3  contingency  tables. 

Then  one  navigates  down  the  decision  tree,  figure  B.l,  moving  down  a  dashed 
arc  when  an  inequality  from  table  B.2  is  violated  and  dovm  a  solid  arc  otherwise. 

The  chamber  labeled  by  the  leaf  reached  then  has  the  correct  polynomial  and 
the  values  xi,  •  •  • ,  0:5  are  substituted  into  the  polynomial  to  get  the  desired  count. 
A  complete  list  of  the  3  by  3  polynomials  follows. 
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chamber  12 


(nlO)rel6 - -  (nll)rel2 


chamberl  1 


chamber  10 


chamber9 


(n6)rel7 - ^  (n7)rel0 


(n9)rel2 - chambers 


(n8)rel2 - chamber7 


(n0)rel3 


chamber6 


(nl)relO - ^  (n4)rel4 


chambers 


(n5)rel5 


(n2)rel5 - chamber2 


(n3)rell 


chamber4 


chamber3 


chamberl 


chamberO 


Figure  B.l :  Decision  tree  for  3  by  3  contingency  tables. 
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relO  : 

[  0 

0 

1 

-1 

-1 

]• 

•  X 

< 

0 

rell  : 

[  0 

1 

0 

-1 

-1 

] 

•  X 

< 

0 

rel2  : 

[  0 

1 

0 

-1 

0 

] 

‘  X 

< 

0 

rel3  : 

[  0 

1 

0 

0 

-1 

] 

•  X 

< 

0 

rel4  : 

[  1 

0 

1 

-1 

-1 

] 

*  X 

< 

0 

rel5  : 

[  1 

1 

0 

-1 

-1 

] 

■  X 

< 

0 

rel6  : 

[  1 

1 

0 

-1 

0 

] 

•  X 

< 

0 

rel7  : 

[  1 

1 

0 

0 

-1 

] 

•  X 

< 

0 

Table  B.2:  Decision  tree  inequalities  for  3  by  3  contingency  tables. 


chamber  0:  —^xjxi  —^xjxs  +^x\x/^xs  —\x\  +\x\_x^x$  —^x\  +\x\Xn 

-\-\x\Xi  -\-X^Xs  -\-\x\  +X4  +Xi  +1 


4~Z~D  ‘  '  2  i-  4*  j  '  1  ^ — '  3  ■  24  4  -4  - J 

-^X4xl  -^x\-'^x\  +\x\X4Xs  ^\x\-'^x\x4  -\x\xs  -\-\x2x\^-^X2X4Xs  +1x2x1 
4^4  4^4®5  4®4^5  4^5  ”I”6®1®4  +'^XiX^  24^2  ”t”j2®2^4  +'^X2X$ 

^®4  ”t”  j2®4®5  ^^5  +4^1  +4X2  “1”42^4  ”I”4^5  H”1 

chamber  2:  —^x\  -lxjx2  +^a;ia:;4  +|a;ia:5  —^xjxl  +\x]x2X4  +jx\x2Xs  —\x\x4 
-)^x\x\  -|a;^  -\x\x2  +lxjx4  +lx]xs  -^Xixj  +\xiX2X4  +'^XiX2Xs  -\xixl 

Ixixj  -^x\  +^X\X2  +^XiX4  +%X\Xs  -^X^  +X2X4  +X2X5  -^xj  -^X^  +Xi 

Ixo  +hx4  +^a;4  +1 


2^5  “l”g2!i3J5  24^2  +12^^^^  “t”  j2®2^5  24^3  H“'j2®3^5 

-%xl -\x4Xs  -%x\  +\xi  +\x2  +Ja:3  +52:4  +|a;5  +1 
chamber  4:  —^x\  —\x\x2  +\x\x4  +\x\xs  —\x\x2  +\x\x2X4  +^x\x2Xs  —\x\x\ 
-\x\x]  -^x\  +\xlx4  +1x1x5  -\xlxl  -\xlx4X5  -\xlx\  +lx2x\  +ia;3a;^a;5 
+  ^X2X4x\  +^a;3a;5  -^x\  -\xlx5  -\xlx\  -\x4x\  -^a;^  -\x\  -\x\x2  +|a;ia;4 
+\x\x5  -ixixl  +|a;ia;2a;4  +l;XiX2X5  -|a;ia;^  -ixix}  +ia;^  -\xlx4  -|a;^a:5 


1  2. 

.-T24^l4t/5  ""2^2  "l"^2^4 

+\x2  +Jx3  +ia:4  +ia;5  +1 


chamber  5:  —^x\  —  jXia;2  —\x\x2  +^a;ia;4  +\x\x5  —^x\x2  +^a;ia;2X4  +jx\x2X5 
-\x\xl  +\x\x2X4  +\x\x2X5  -^xjxj  -^xjx4X5  -\x\x\  -\x\  -\x\x2  -|a:ia:3 
+|a;ia:4  +|a:jX5  —\x\X2  +|a;iX2a;4  +|a;iX2a:5  — |a:ia:3  +|a;ia;3a;4  +|a;ia;3a;5 
~x\xl  — |xia;4a;5  — |a;ia;5  —^x\  +^a;ia;2  +^XiX2  +X\X4  +3:1X5  —5X2  +X2X4 
-l-.riTc  — i.r?  A-t.iTa  -i-xix^:  —xi  —xaXc  —x^  ++i;i  -i-lx^  +xa;'>  +1 
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chamber  7:  —^x\x2  -\-\x\xa  +^a:ja;5  —^x\x\  +^x\x2X/i  -\-^x\x2Xs  ~x\~^ 

—  \x\x\  —\x\X2  +\x\X2X/[  -\-\xiX2Xs  —\x\X2x\  —^X\X2x\  +lxixl  -]r\x\X^^ 
1  „4  ,  1„3™  1  1^3^.  U2^2  \J2J1  I  i^-^3  i  1^.^3  _  1  ^4  _  1  ^4  _i^3 


-hA  +\A^4  +zA^5  -\xlA  -\AA  +\^iA  +z^2X5  -^xj  -^xj  -|a:i 

—  4.r?.'rT  -\-\x1;xa  A-\x?:x^  —\x:xl  A-^x^x-^Xa  -\-xXiX')X‘;  —^x^x^  _3~  ~2 


-lx\x2  +|a:ia;4  +lx]x5  -|a;ia;^  +|a:ia;2a;4  +|a;iX2a;5  -^a:ia;|  ^  ^  - 

+  |a;|a;4  +|a:|a;5  -|a;2a:4  -13:23:5  -\-\x\  +^3;^  +^2:13:2  -\-^X\Xa  +||3:ia:5 

~hA  +11^23:4  +153:2X5  -^xj  -^x\  +3:1  +X2  +5X4  +^3:5  +1 
chamber  8:  -^x|  -\x\x2  +^x^3;4  +|x^a:5  -\x\x\  +|a:i3:2X4  +5xfa:2X5  -\x\x\ 
-\x\xl  -\xxxl  ^\xxx\xs  -\xxX2x\  +|xi3:^  -^4  +5X3X5  -Jx^x^  +|x23:| 
-hx\  +5x^x4  +^x|a;5  -\x^x\  -i3;|3:4a;5  -^xjxj  +5X33:^  +^X3xjx5  +^0:33:40;^ 
+1x33:1  -jjxl  -Ixlxs  -jxjxj  -^3:4x1  -^xj  -\x]  -1x^X2  +|x?3;4  +Jx?3;5 
-|a;i3:^  +|a;iX23:4  +§3:13:23:5  -|xia:^  -|xi3:|  -\xl  +\x\xs  -|x2X^  +^3;^ 
-'^xlx^  -\xlxs  +13:33;^  +|a;3a;4a:5  +\x2x\  -\xl  -^xlxs  -7X43:^ 

+|xi3:2  +||xi3:4  +]^xxX5  -fe  +3:33:4  +§§3:2X5  -§§3;^  +§5X3x4  +§5X3X5  -5|x| 
-§5X4X5  -§5xf  +Xi  +|x2  +|x3  +§X4  +1 

chamber  9:  — -^Xj  — §XjX2  +§XiX4  +§XiX5  — ^xfx^  +5X1X2X4  +5X1X2X5  —7X1X4 

-\x\x\  -5X1X2  +5X1X2X4  H-5X1X2X5  -§XiX2X4  — 5X1X2X5  +§XiX4  +7X1X5 
-H®2  +5X^X4  +§X§X5  -§X§X^  -§:X§X^  +7X2X^  +§X2X^  -+X^  +5X^X4  +|x^X5 
7X2X4  5X2X4X5  ^XjXj  +gX3X4  ■|■5X3X4X5  +5X3X4XJ  +gX3X5  12^4  5X4X5 

-7X4X1  -§X4X^  —kxs  -5X§  -|x§X2  +|x§X4  +|x§X5  -|xix§  +5X1X2X4 
+  |XiX2X5  -|xiX^  -fxiX^  -§X§  +|x§X4  +|x§X5  -|x2x5  -|x2X^  +§:X^  -7X^X4 
-|x3X5  +|x3X^  +IX3X4X5  +1X3X5  -5X^X5  -|X4X^  +-^XiXt  +|IxiX4 


12^1^2  +n^l^4 

+  14x1X5  -^X§  +§5X2X4  +§5X2X5  -5§x^  +§5X3X4  +§5X3X5  -§5X^  -§5X4X5 


—  I5X5  +Xi  +X2  +5X3  +1 
chamber  10:  — ^xf  — §XiX2  +§XiX4  +5X1X2X4  -5X1X4  — 5X1  +5X1X4  +|xiX2X4 
-|xiX^  +^x§  +^XiX2  +§5X1X4  +X2X4  -5X4  +5X1  +X2  +5X4  +1 

chamber  11:  — ^Xj  —5X1X2  +5X1X4  +5X1X2X4  —5X1X4  —5X1X2  +5X1X2X4 

-§XiX2x5  +§XiX^  “^^2  +§X§X4  -Jx^x^  +§X2x|  -^x\  -|x§  +5X^X4 

+  5X1X2X4  -5X1X4  -5X§  +5X^X4  -5X2X^  +5X^  +^*i  +5X1X2  +§5X1X4  +^X§ 
+  §§X2X4  -55X^  +5X1  +5X2  +5X4  +1 

chamber  12:  5X1X2  +|xiX2  +5X1X2  +5X1  +5X1X2  +5X2  +5X1  +5X2  +1 
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